Javascript must be enabled to continue!
Probabilistic Modeling for Whole Metagenome Profiling
View through CrossRef
To address the shortcomings in existing Markov model implementations in handling large amount of metagenomic data with comparable or better accuracy in classification, we developed a new algorithm based on pseudo-count supplemented standard Markov model (SMM), which leverages the power of higher order models to more robustly classify reads at different taxonomic levels. Assessment on simulated metagenomic datasets demonstrated that overall SMM was more accurate in classifying reads to their respective taxa at all ranks compared to the interpolated methods. Higher order SMMs (9th order or greater) also outperformed BLAST alignments in assigning taxonomic labels to metagenomic reads at different taxonomic ranks (genus and higher) on tests that masked the read originating species (genome models) in the database. Similar results were obtained by masking at other taxonomic ranks in order to simulate the plausible scenarios of non-representation of the source of a read at different taxonomic levels in the genome database. The performance gap became more pronounced with higher taxonomic levels. To eliminate contaminations in datasets and to further improve our alignment-free approach, we developed a new framework based on a genome segmentation and clustering algorithm. This framework allowed removal of adapter sequences and contaminant DNA, as well as generation of clusters of similar segments, which were then used to sample representative read fragments to constitute training datasets. The parameters of a logistic regression model were learnt from these training datasets using a Bayesian optimization procedure. This allowed us to establish thresholds for classifying metagenomic reads by SMM. This led to the development of a Python-based frontend that combines our SMM algorithm with the logistic regression optimization, named POSMM (Python Optimized Standard Markov Model). POSMM provides a much-needed alternative to metagenome profiling programs. Our algorithm that builds the genome models on the fly, and thus obviates the need to build a database, complements alignment-based classification and can thus be used in concert with alignment-based classifiers to raise the bar in metagenome profiling.
Title: Probabilistic Modeling for Whole Metagenome Profiling
Description:
To address the shortcomings in existing Markov model implementations in handling large amount of metagenomic data with comparable or better accuracy in classification, we developed a new algorithm based on pseudo-count supplemented standard Markov model (SMM), which leverages the power of higher order models to more robustly classify reads at different taxonomic levels.
Assessment on simulated metagenomic datasets demonstrated that overall SMM was more accurate in classifying reads to their respective taxa at all ranks compared to the interpolated methods.
Higher order SMMs (9th order or greater) also outperformed BLAST alignments in assigning taxonomic labels to metagenomic reads at different taxonomic ranks (genus and higher) on tests that masked the read originating species (genome models) in the database.
Similar results were obtained by masking at other taxonomic ranks in order to simulate the plausible scenarios of non-representation of the source of a read at different taxonomic levels in the genome database.
The performance gap became more pronounced with higher taxonomic levels.
To eliminate contaminations in datasets and to further improve our alignment-free approach, we developed a new framework based on a genome segmentation and clustering algorithm.
This framework allowed removal of adapter sequences and contaminant DNA, as well as generation of clusters of similar segments, which were then used to sample representative read fragments to constitute training datasets.
The parameters of a logistic regression model were learnt from these training datasets using a Bayesian optimization procedure.
This allowed us to establish thresholds for classifying metagenomic reads by SMM.
This led to the development of a Python-based frontend that combines our SMM algorithm with the logistic regression optimization, named POSMM (Python Optimized Standard Markov Model).
POSMM provides a much-needed alternative to metagenome profiling programs.
Our algorithm that builds the genome models on the fly, and thus obviates the need to build a database, complements alignment-based classification and can thus be used in concert with alignment-based classifiers to raise the bar in metagenome profiling.
Related Results
Inventory and pricing management in probabilistic selling
Inventory and pricing management in probabilistic selling
Context: Probabilistic selling is the strategy that the seller creates an additional probabilistic product using existing products. The exact information is unknown to customers u...
Optimising primary molecular profiling in NSCLC
Optimising primary molecular profiling in NSCLC
AbstractIntroductionMolecular profiling of NSCLC is essential for optimising treatment decisions, but often incomplete. We assessed the efficacy of protocolised molecular profiling...
Embracing Opportunities and Avoiding Pitfalls of Probabilistic Modelling in Field Development Planning
Embracing Opportunities and Avoiding Pitfalls of Probabilistic Modelling in Field Development Planning
Abstract
Uncertainty and risk analysis is an inseparable part of any decision making process in the field development planning. This study sheds light on the availab...
Optimisation in Neurosymbolic Learning Systems
Optimisation in Neurosymbolic Learning Systems
In the last few years, Artificial Intelligence (AI) has reached the public consciousness through high-profile applications such as chatbots, image generators, speech synthesis and ...
Letting neural networks talk: exploring two probabilistic neural network models for input variable selection
Letting neural networks talk: exploring two probabilistic neural network models for input variable selection
Input variable selection (IVS) is an integral part of building data-driven models for hydrological applications. Carefully chosen input variables enable data-driven models to disce...
Probabilistic Linguistics
Probabilistic Linguistics
For the past forty years, linguistics has been dominated by the idea that language is categorical and linguistic competence discrete. It has become increasingly clear, however, tha...
Methods and Algorithms for Pseudo-probabilistic Encryption with Shared Key
Methods and Algorithms for Pseudo-probabilistic Encryption with Shared Key
As a method for providing security of the messages sent via a public channel in the case of potential coercive attacks there had been proposed algorithms and protocols of deniable ...
High-quality probabilistic predictions for existing hydrological models with common objective functions    
High-quality probabilistic predictions for existing hydrological models with common objective functions    
<p>Probabilistic predictions describe the uncertainty in modelled streamflow, which is a critical input for many environmental modelling applications.&#160; A...

