Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations

View through CrossRef
Abstract While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies. This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature. To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label. By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene–variant–gene–variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene–variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face. Database URL: https://huggingface.co/datasets/cnachteg/duvel or https://doi.org/10.57967/hf/1571
Title: DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations
Description:
Abstract While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies.
This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature.
To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label.
By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.
e.
gene–variant–gene–variant, were extracted.
The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform.
The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT.
More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles.
When applied to gene–variant pair detection, BiomedBERT-large achieves the highest F1 score (0.
84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset.
This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications.
DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants.
It is made freely available for research on GitHub and Hugging Face.
Database URL: https://huggingface.
co/datasets/cnachteg/duvel or https://doi.
org/10.
57967/hf/1571.

Related Results

Pre-trained language models to curate oligogenic data
Pre-trained language models to curate oligogenic data
Work presented for the workshop on Text mining services to support scalable curation. By pre-annotating 85 full-text articles containing the releva...
Oligogenic combinations of rare variants influence specific phenotypes in complex disorders
Oligogenic combinations of rare variants influence specific phenotypes in complex disorders
ABSTRACT Genetic studies of complex disorders such as autism and intellectual disability (ID) are often based on enrichment of individual rare va...
Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika
Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika
Genre implies formal and stylistic conventions of a particular text type, which inevitably affects the translation process. This „force of genre bias“ (Prieto Ramos, 2014) has been...
Updating and extending the concept annotations of the CRAFT corpus
Updating and extending the concept annotations of the CRAFT corpus
With the ever-rising amount of biomedical literature, it is increasingly difficult for scientists to keep up with the published work in their fields of research, much less related ...
Updating and extending the concept annotations of the CRAFT corpus
Updating and extending the concept annotations of the CRAFT corpus
With the ever-rising amount of biomedical literature, it is increasingly difficult for scientists to keep up with the published work in their fields of research, much less related ...
CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021
CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021
The pandemic Covid-19 currently demands teachers to be able to use technology in teaching and learning process. But in reality there are still many teachers who have not been able ...
Concept-based and relation-based corpus navigation : applications of natural language processing in digital humanities
Concept-based and relation-based corpus navigation : applications of natural language processing in digital humanities
Navigation en corpus fondée sur les concepts et les relations : applications du traitement automatique des langues aux humanités numériques La recherche en Sciences...

Back to Top