Javascript must be enabled to continue!
Zygosity-Aware DNA Language Modeling Improves Ancestry and Gene Expression Prediction
View through CrossRef
A
bstract
DNA language models (DNA-LMs) are transforming how genomic sequence information is represented and interpreted. Yet most current approaches treat DNA as a single sequence, overlooking the diploid structure and zygosity information that distinguish the two parental copies of the genome. Here, we systematically evaluate explicit diploid, zygosity-aware representations in DNA-LMs for two downstream tasks: ancestry classification and gene expression prediction. For ancestry, we use HyenaDNA embeddings of the extended MHC region and show that concatenating maternal and paternal haplotype embeddings consistently improves predictive performance across five superpopulations compared to single-haplotype inputs. For gene expression, we compare convolutional neural networks (CNNs) trained from scratch with Nucleotide Transformer models using reference-only, single-copy, and two-copy (zygosity-aware) sequence encodings. CNNs showed increased performance by incorporating genetic variation and zygosity via simple additive genotype encoding, whereas naïvely injecting variation into pretrained Nucleotide Transformer models yields mixed effects, highlighting a mismatch between current pretraining objectives and variation-sensitive prediction. Together, our results demonstrate that zygosity-aware representations can capture biologically meaningful information beyond reference-only views and underscore the need for diploid- and population-aware pretraining strategies in future DNA-LMs for variant interpretation and precision medicine.
Title: Zygosity-Aware DNA Language Modeling Improves Ancestry and Gene Expression Prediction
Description:
A
bstract
DNA language models (DNA-LMs) are transforming how genomic sequence information is represented and interpreted.
Yet most current approaches treat DNA as a single sequence, overlooking the diploid structure and zygosity information that distinguish the two parental copies of the genome.
Here, we systematically evaluate explicit diploid, zygosity-aware representations in DNA-LMs for two downstream tasks: ancestry classification and gene expression prediction.
For ancestry, we use HyenaDNA embeddings of the extended MHC region and show that concatenating maternal and paternal haplotype embeddings consistently improves predictive performance across five superpopulations compared to single-haplotype inputs.
For gene expression, we compare convolutional neural networks (CNNs) trained from scratch with Nucleotide Transformer models using reference-only, single-copy, and two-copy (zygosity-aware) sequence encodings.
CNNs showed increased performance by incorporating genetic variation and zygosity via simple additive genotype encoding, whereas naïvely injecting variation into pretrained Nucleotide Transformer models yields mixed effects, highlighting a mismatch between current pretraining objectives and variation-sensitive prediction.
Together, our results demonstrate that zygosity-aware representations can capture biologically meaningful information beyond reference-only views and underscore the need for diploid- and population-aware pretraining strategies in future DNA-LMs for variant interpretation and precision medicine.
Related Results
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
Prognostic and Therapeutic Relevance of BRCA1/2 Zygosity in Prostate Cancer: A Multicohort Desk-Based Analysis
Prognostic and Therapeutic Relevance of BRCA1/2 Zygosity in Prostate Cancer: A Multicohort Desk-Based Analysis
ABSTRACT
Introduction
BRCA1/2 alterations are increasingly recognized as biologically and clinically relevant features in prost...
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...
Abstract 1166: Leveraging existing data to identify ancestry-associated features across multiple cancer types
Abstract 1166: Leveraging existing data to identify ancestry-associated features across multiple cancer types
Abstract
People with different ancestries inherit different risks and encounter different environmental exposures resulting in different somatic profiles. A lack of ...
Genome wide hypomethylation and youth-associated DNA gap reduction promoting DNA damage and senescence-associated pathogenesis
Genome wide hypomethylation and youth-associated DNA gap reduction promoting DNA damage and senescence-associated pathogenesis
Abstract
Background: Age-associated epigenetic alteration is the underlying cause of DNA damage in aging cells. Two types of youth-associated DNA-protection epigenetic mark...
Genome wide hypomethylation and youth-associated DNA gap reduction promoting DNA damage and senescence-associated pathogenesis
Genome wide hypomethylation and youth-associated DNA gap reduction promoting DNA damage and senescence-associated pathogenesis
Introduction: The United States currently faces two opioid crises, an evolved crisis currently manifesting as widespread abuse of illicit opioids, and a crisis in pain management l...
Abstract 1599: Determining patient ancestry based on targeted tumor comprehensive genomic profiling
Abstract 1599: Determining patient ancestry based on targeted tumor comprehensive genomic profiling
Abstract
Background. Cancer gene mutations exhibit mutation patterns of prevalence that vary across different ancestry groups. For example, EGFR variants are more fr...
Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program
Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program
Abstract
Funding Acknowledgements
Type of funding sources: None.
INTRODUCTION Patients with heart failure (HF)...

