Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Zygosity-Aware DNA Language Modeling Improves Ancestry and Gene Expression Prediction

View through CrossRef
A bstract DNA language models (DNA-LMs) are transforming how genomic sequence information is represented and interpreted. Yet most current approaches treat DNA as a single sequence, overlooking the diploid structure and zygosity information that distinguish the two parental copies of the genome. Here, we systematically evaluate explicit diploid, zygosity-aware representations in DNA-LMs for two downstream tasks: ancestry classification and gene expression prediction. For ancestry, we use HyenaDNA embeddings of the extended MHC region and show that concatenating maternal and paternal haplotype embeddings consistently improves predictive performance across five superpopulations compared to single-haplotype inputs. For gene expression, we compare convolutional neural networks (CNNs) trained from scratch with Nucleotide Transformer models using reference-only, single-copy, and two-copy (zygosity-aware) sequence encodings. CNNs showed increased performance by incorporating genetic variation and zygosity via simple additive genotype encoding, whereas naïvely injecting variation into pretrained Nucleotide Transformer models yields mixed effects, highlighting a mismatch between current pretraining objectives and variation-sensitive prediction. Together, our results demonstrate that zygosity-aware representations can capture biologically meaningful information beyond reference-only views and underscore the need for diploid- and population-aware pretraining strategies in future DNA-LMs for variant interpretation and precision medicine.
Title: Zygosity-Aware DNA Language Modeling Improves Ancestry and Gene Expression Prediction
Description:
A bstract DNA language models (DNA-LMs) are transforming how genomic sequence information is represented and interpreted.
Yet most current approaches treat DNA as a single sequence, overlooking the diploid structure and zygosity information that distinguish the two parental copies of the genome.
Here, we systematically evaluate explicit diploid, zygosity-aware representations in DNA-LMs for two downstream tasks: ancestry classification and gene expression prediction.
For ancestry, we use HyenaDNA embeddings of the extended MHC region and show that concatenating maternal and paternal haplotype embeddings consistently improves predictive performance across five superpopulations compared to single-haplotype inputs.
For gene expression, we compare convolutional neural networks (CNNs) trained from scratch with Nucleotide Transformer models using reference-only, single-copy, and two-copy (zygosity-aware) sequence encodings.
CNNs showed increased performance by incorporating genetic variation and zygosity via simple additive genotype encoding, whereas naïvely injecting variation into pretrained Nucleotide Transformer models yields mixed effects, highlighting a mismatch between current pretraining objectives and variation-sensitive prediction.
Together, our results demonstrate that zygosity-aware representations can capture biologically meaningful information beyond reference-only views and underscore the need for diploid- and population-aware pretraining strategies in future DNA-LMs for variant interpretation and precision medicine.

Related Results

Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...
Abstract 1166: Leveraging existing data to identify ancestry-associated features across multiple cancer types
Abstract 1166: Leveraging existing data to identify ancestry-associated features across multiple cancer types
Abstract People with different ancestries inherit different risks and encounter different environmental exposures resulting in different somatic profiles. A lack of ...
Genome wide hypomethylation and youth-associated DNA gap reduction promoting DNA damage and senescence-associated pathogenesis
Genome wide hypomethylation and youth-associated DNA gap reduction promoting DNA damage and senescence-associated pathogenesis
Abstract Background: Age-associated epigenetic alteration is the underlying cause of DNA damage in aging cells. Two types of youth-associated DNA-protection epigenetic mark...
Genome wide hypomethylation and youth-associated DNA gap reduction promoting DNA damage and senescence-associated pathogenesis
Genome wide hypomethylation and youth-associated DNA gap reduction promoting DNA damage and senescence-associated pathogenesis
Introduction: The United States currently faces two opioid crises, an evolved crisis currently manifesting as widespread abuse of illicit opioids, and a crisis in pain management l...
Abstract 1599: Determining patient ancestry based on targeted tumor comprehensive genomic profiling
Abstract 1599: Determining patient ancestry based on targeted tumor comprehensive genomic profiling
Abstract Background. Cancer gene mutations exhibit mutation patterns of prevalence that vary across different ancestry groups. For example, EGFR variants are more fr...

Back to Top