Javascript must be enabled to continue!

DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations

Abstract While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies. This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature. To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label. By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene–variant–gene–variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene–variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face. Database URL: https://huggingface.co/datasets/cnachteg/duvel or https://doi.org/10.57967/hf/1571

Oxford University Press (OUP)

Charlotte Nachtegael Jacopo De Stefani Anthony Cnudde Tom Lenaerts

Database

2024

Title: DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations

Description:

This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature.

To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label.

By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.

gene–variant–gene–variant, were extracted.

The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform.

The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT.

More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles.

When applied to gene–variant pair detection, BiomedBERT-large achieves the highest F1 score (0.

84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset.

This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications.

DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants.

It is made freely available for research on GitHub and Hugging Face.

Database URL: https://huggingface.

co/datasets/cnachteg/duvel or https://doi.

org/10.

57967/hf/1571.

Back

Work presented for the workshop on Text mining services to support scalable curation. By pre-annotating 85 full-text articles containing the releva...

Oligogenic combinations of rare variants influence specific phenotypes in complex disorders

ABSTRACT Genetic studies of complex disorders such as autism and intellectual disability (ID) are often based on enrichment of individual rare va...

Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika

Genre implies formal and stylistic conventions of a particular text type, which inevitably affects the translation process. This „force of genre bias“ (Prieto Ramos, 2014) has been...

Updating and extending the concept annotations of the CRAFT corpus

With the ever-rising amount of biomedical literature, it is increasingly difficult for scientists to keep up with the published work in their fields of research, much less related ...

Updating and extending the concept annotations of the CRAFT corpus

With the ever-rising amount of biomedical literature, it is increasingly difficult for scientists to keep up with the published work in their fields of research, much less related ...

CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021

The pandemic Covid-19 currently demands teachers to be able to use technology in teaching and learning process. But in reality there are still many teachers who have not been able ...

Concept-based and relation-based corpus navigation : applications of natural language processing in digital humanities

Navigation en corpus fondée sur les concepts et les relations : applications du traitement automatique des langues aux humanités numériques La recherche en Sciences...

Over-Sampling Effect in Pre-Training for Bidirectional Encoder Representations from Transformers (BERT) to Localize Medical BERT and Enhance Biomedical BERT (Preprint)

BACKGROUND Pre-training large-scale neural language models on raw texts has made a significant contribution to improving transfer learning in natural langua...

Email:
Password:

Email:

DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations

Related Results