Javascript must be enabled to continue!

Seq2KING: An unsupervised internal transformer representation of global human heritages

Abstract Determining the intricate tapestry of human genetic relationships is a central challenge in population genetics and precision medicine. We propose that the principles of lexical connectivity, which words derive meaning from their contextual interactions, can be adapted to genetic data, enabling transformer models to reveal that individuals with higher genetic similarity form stronger latent connections. We explored this by transposing KING kinship-related matrices into the (query, key, value) QKV latent space within transformer models and determined that attention mechanisms can capture genetic relatedness in an unsupervised fashion. We found that individuals had an attention weight connectivity of 85.34% (p<0.05) if they were from within the same continent, compared to if they were from other continents. Surprisingly, we found that some encoder layers required inversion of their latent representations for this connectivity to become obvious. Lastly, we used BERTViz to create human-readable hyper-dense connectivity patterns among individuals. Our approach is purely based on attention, which yields a non-discrete spectrum of relatedness, and thus uncovers patterns on first principles. Seq2KING addresses the significant challenge of discovering population structures to construct a global human relatedness map, without relying on predefined labels. Our excavation into the latent space is a paradigm shift from legacy-supervised genetic methodologies, which presents a new way to understand the human pangenome as well as discern population substructures for creating precision genetic medicines. Non-Expert Description Is it possible to build artificial intelligence (AI) to read the human genome as a first language? Why would one want such AI? We at Ecotone believe that such AI will provide the genetic coordinates needed to manufacture CRISPRs medications to cure about ∼10,000 genetic diseases. How does one build such AI? Our recently released model dnaSORA proposed a means to assign meaning to every single token (typically referred to as a base) of all 3 billion tokens in the human genome (Koreniuk & Njie, 2025). This builds the vocabulary for reading the human genome as a first language. For dnaSORA to work, it needs to know the heritages of people that are in its model of our genetics. We mostly rely on country, culture and geography to determine our heritages, but this is too error-prone for dnaSORA. Also error-prone in our experience are legacy genetic approaches such as those used by 23andMe. Our research here introduces Seq2KING, a new artificial intelligence method that is based on excavating the insides of transformers to uncover hidden patterns of genetic relatedness among people around the world—without needing any prior labels or categorizations. The key innovation of Seq2KING is applying the principles of lexical connectivity— how words derive meaning through their relationships to other words—to genetic data. Just as “dog” gains meaning through its connections to words like “pet,” “animal,” and “loyal,” we show that individuals’ genomes can be understood through their genetic connections to others. We start by converting raw genetic data into a compact kinship matrix (using a tool called KING) that summarizes how closely everyone is related. We then feed these kinship values into a transformer model—the same kind of AI behind cutting-edge language tools like ChatGPT. Inside the transformer, special components called “attention heads” learn which individuals are most similar, strengthening links between people from the same region and showing subtler connections across continents. Unlike legacy approaches that rely on discrete pre-defined categories, Seq2KING provides continuous measures of relatedness, allowing us to visualize connections between any individual and all other humans. Additionally, because Seq2KING operates directly within the transformer’s internal reasoning system, it can be seamlessly integrated as a component within larger genome interpretation systems—essentially functioning like high-speed cache memory for heritage assignments, dramatically improving both efficiency and scalability. By examining these attention patterns, we can reconstruct familiar population groupings—such as European, African, and Asian heritage—entirely by the model’s internal logic. Finally, we use a visualization technique (BERTViz) to turn these dense connection maps into intuitive diagrams that highlight population connections between individuals. Because our approach doesn’t rely on pre-assigned labels, it offers a truly unbiased way to explore human population structure. This could help scientists trace migration routes that resulted in the peopling of the continents, find subtle subgroups within larger populations, and remove “background noise” in genetic studies of disease. Ultimately, Seq2KING paves the way for more precise genetic maps of all humans, revealing the natural “family trees” hidden in our DNA and bringing us one step closer to reading the human genome as a first language.

openRxiv

Bhavana Jonnalagadda eMalick Njie

2025

Title: Seq2KING: An unsupervised internal transformer representation of global human heritages

Description:

Abstract Determining the intricate tapestry of human genetic relationships is a central challenge in population genetics and precision medicine.

We propose that the principles of lexical connectivity, which words derive meaning from their contextual interactions, can be adapted to genetic data, enabling transformer models to reveal that individuals with higher genetic similarity form stronger latent connections.

We explored this by transposing KING kinship-related matrices into the (query, key, value) QKV latent space within transformer models and determined that attention mechanisms can capture genetic relatedness in an unsupervised fashion.

We found that individuals had an attention weight connectivity of 85.

34% (p<0.

05) if they were from within the same continent, compared to if they were from other continents.

Surprisingly, we found that some encoder layers required inversion of their latent representations for this connectivity to become obvious.

Lastly, we used BERTViz to create human-readable hyper-dense connectivity patterns among individuals.

Our approach is purely based on attention, which yields a non-discrete spectrum of relatedness, and thus uncovers patterns on first principles.

Seq2KING addresses the significant challenge of discovering population structures to construct a global human relatedness map, without relying on predefined labels.

Our excavation into the latent space is a paradigm shift from legacy-supervised genetic methodologies, which presents a new way to understand the human pangenome as well as discern population substructures for creating precision genetic medicines.

Non-Expert Description Is it possible to build artificial intelligence (AI) to read the human genome as a first language? Why would one want such AI? We at Ecotone believe that such AI will provide the genetic coordinates needed to manufacture CRISPRs medications to cure about ∼10,000 genetic diseases.

How does one build such AI? Our recently released model dnaSORA proposed a means to assign meaning to every single token (typically referred to as a base) of all 3 billion tokens in the human genome (Koreniuk & Njie, 2025).

This builds the vocabulary for reading the human genome as a first language.

For dnaSORA to work, it needs to know the heritages of people that are in its model of our genetics.

We mostly rely on country, culture and geography to determine our heritages, but this is too error-prone for dnaSORA.

Also error-prone in our experience are legacy genetic approaches such as those used by 23andMe.

Our research here introduces Seq2KING, a new artificial intelligence method that is based on excavating the insides of transformers to uncover hidden patterns of genetic relatedness among people around the world—without needing any prior labels or categorizations.

The key innovation of Seq2KING is applying the principles of lexical connectivity— how words derive meaning through their relationships to other words—to genetic data.

Just as “dog” gains meaning through its connections to words like “pet,” “animal,” and “loyal,” we show that individuals’ genomes can be understood through their genetic connections to others.

We start by converting raw genetic data into a compact kinship matrix (using a tool called KING) that summarizes how closely everyone is related.

We then feed these kinship values into a transformer model—the same kind of AI behind cutting-edge language tools like ChatGPT.

Inside the transformer, special components called “attention heads” learn which individuals are most similar, strengthening links between people from the same region and showing subtler connections across continents.

Unlike legacy approaches that rely on discrete pre-defined categories, Seq2KING provides continuous measures of relatedness, allowing us to visualize connections between any individual and all other humans.

Additionally, because Seq2KING operates directly within the transformer’s internal reasoning system, it can be seamlessly integrated as a component within larger genome interpretation systems—essentially functioning like high-speed cache memory for heritage assignments, dramatically improving both efficiency and scalability.

By examining these attention patterns, we can reconstruct familiar population groupings—such as European, African, and Asian heritage—entirely by the model’s internal logic.

Finally, we use a visualization technique (BERTViz) to turn these dense connection maps into intuitive diagrams that highlight population connections between individuals.

Because our approach doesn’t rely on pre-assigned labels, it offers a truly unbiased way to explore human population structure.

This could help scientists trace migration routes that resulted in the peopling of the continents, find subtle subgroups within larger populations, and remove “background noise” in genetic studies of disease.

Ultimately, Seq2KING paves the way for more precise genetic maps of all humans, revealing the natural “family trees” hidden in our DNA and bringing us one step closer to reading the human genome as a first language.

Back

Related Results

Automatic Load Sharing of Transformer

Transformer plays a major role in the power system. It works 24 hours a day and provides power to the load. The transformer is excessive full, its windings are overheated which lea...

High frequency modeling of power transformers under transients

This thesis presents the results related to high frequency modeling of power transformers. First, a 25kVA distribution transformer under lightning surges is tested in the laborator...

ANALISIS PENGARUH MASA OPERASIONAL TERHADAP PENURUNAN KAPASITAS TRANSFORMATOR DISTRIBUSI DI PT PLN (PERSERO)

One cause the interruption of transformer is loading that exceeds the capabilities of the transformer. The state of continuous overload will affect the age of the transformer and r...

A novel unsupervised deep learning network for intelligent fault diagnosis of rotating machinery

Generally, the health conditions of rotating machinery are complicated and changeable. Meanwhile, its fault labeled information is mostly unknown. Therefore, it is man-sized to aut...

LIFE CYCLE OF TRANSFORMER 110/X KV AND ITS VALUE

In a deregulated environment, power companies are in the constant process of reducing the costs of operating power facilities, with the aim of optimally improving the quality of de...

PLC Based Load Sharing of Transformers

The transformer is very expensive and bulky power system equipment. It runs and feed the load for 24 hours a day. Sometimes the load on the transformer unexpectedly rises above its...

Simulation modeling study on short circuit ability of distribution transformer

Abstract Under short circuit condition, the oil immersed distribution transformer will endure combined electro-thermal stress, eventually lead to the mechanical dama...

Uji Performa Trafo E-I dengan Variasi Rapat Fluks pada Sumber Tegangan Non-Sinusoidal

In the industrial field, transformers are used as a step-up voltage and are used as a step-down level. In a loaded transformer, the power that comes out of the transformer (transfo...

Email:
Password:

Email: