Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Reducing haystacks to needles – ViralClust: A Nextflow pipeline to cluster viral sequences

View through CrossRef
Abstract The rapid accumulation of viral genome sequences presents major challenges for downstream analysis tools, including tools for multiple sequence alignments, phylogeny, and genome/alignment visualization, due to computational constraints and sampling biases caused by outbreak-driven over-representation. Selecting representative genomes through clustering offers a principled alternative to random subsampling, yet choosing appropriate clustering strategies remains non-trivial and context dependent. Here, we present ViralClust , a modular Nextflow pipeline for bias-aware representative selection from large viral genome datasets. ViralClust integrates five distinct clustering algorithms ( CD-HIT-EST, SUMACLUST, VSEARCH, MMSeqs2 , and HDBSCAN ) within a unified workflow, enabling direct comparison of clustering outcomes and flexible adaptation to diverse biological questions, considering a balanced phylogenic distribution of the selected sequences. We evaluated ViralClust on six RNA and DNA virus datasets ranging from 632 to 156,586 sequences and spanning genome lengths from 890 to 197,185 nucleotides. Across all datasets, clustering reduced dataset size by 95% or more while preserving genetic diversity across species, genera, and families, and effectively mitigating biases introduced by outbreaks, partial genomes, and sequence orientation artifacts. By supporting whole-genome clustering and scalable representative selection, ViralClust enables efficient and reproducible downstream analyses that would otherwise be computationally infeasible. Our framework provides a flexible foundation for large-scale viral genomics and supports future applications in comparative analysis and virus classification.
Title: Reducing haystacks to needles – ViralClust: A Nextflow pipeline to cluster viral sequences
Description:
Abstract The rapid accumulation of viral genome sequences presents major challenges for downstream analysis tools, including tools for multiple sequence alignments, phylogeny, and genome/alignment visualization, due to computational constraints and sampling biases caused by outbreak-driven over-representation.
Selecting representative genomes through clustering offers a principled alternative to random subsampling, yet choosing appropriate clustering strategies remains non-trivial and context dependent.
Here, we present ViralClust , a modular Nextflow pipeline for bias-aware representative selection from large viral genome datasets.
ViralClust integrates five distinct clustering algorithms ( CD-HIT-EST, SUMACLUST, VSEARCH, MMSeqs2 , and HDBSCAN ) within a unified workflow, enabling direct comparison of clustering outcomes and flexible adaptation to diverse biological questions, considering a balanced phylogenic distribution of the selected sequences.
We evaluated ViralClust on six RNA and DNA virus datasets ranging from 632 to 156,586 sequences and spanning genome lengths from 890 to 197,185 nucleotides.
Across all datasets, clustering reduced dataset size by 95% or more while preserving genetic diversity across species, genera, and families, and effectively mitigating biases introduced by outbreaks, partial genomes, and sequence orientation artifacts.
By supporting whole-genome clustering and scalable representative selection, ViralClust enables efficient and reproducible downstream analyses that would otherwise be computationally infeasible.
Our framework provides a flexible foundation for large-scale viral genomics and supports future applications in comparative analysis and virus classification.

Related Results

Diversity of endophytic fungi of single Norway spruce needles and their role as pioneer decomposers
Diversity of endophytic fungi of single Norway spruce needles and their role as pioneer decomposers
AbstractThe diversity of endophytic fungi within single symptomless Norway spruce needles is described and their possible role as pioneer decomposers after needle detachment is inv...
Installation Analysis of Matterhorn Pipeline Replacement
Installation Analysis of Matterhorn Pipeline Replacement
Abstract The paper describes the installation analysis for the Matterhorn field pipeline replacement, located in water depths between 800-ft to 1200-ft in the Gul...
Anesthesia (13)
Anesthesia (13)
An in vitro study of dural lesions produced by 25‐gauge Quincke and Whitacre needles evaluated by scanning electron microscopy. (Hospital de Mostoles, Madrid Spain) Reg Anesth Pain...
Inheritance of Cluster Headache and its Possible Link to Migraine
Inheritance of Cluster Headache and its Possible Link to Migraine
SYNOPSIS We evaluated the possibility that cluster headache may be a transmitted disorder, influenced by migraine genetics. In the first part of a two part study,...
A Fluid-pipe-soil Approach to Stability Design of Submarine Pipelines
A Fluid-pipe-soil Approach to Stability Design of Submarine Pipelines
Abstract The conventional approach to submarine pipeline stability design considers interactions between water and pipeline (fluid-pipe) and pipeline and seabed (...
Pipeline Resistance
Pipeline Resistance
Pipeline resistance is where an often abstract and wonky climate movement meets the bravery and boldness of Indigenous and other frontline defenders of land and water who inspire d...
Constructing a VANET based on cluster chains
Constructing a VANET based on cluster chains
SUMMARYThe paper proposes a scheme on constructing a vehicular ad‐hoc network based on cluster chains. In the cluster construction algorithm, the distance from a potential cluster ...
Ciudad de Museos: clústeres de museos en la ciudad contemporánea
Ciudad de Museos: clústeres de museos en la ciudad contemporánea
En nuestra cultura el museo ocupa un lugar privilegiado simbólicamente, pero también físicamente, en la ciudad. Y no tan sólo lo ocupa, sino lo crea, lo define, lo cambia y le da s...

Back to Top