Javascript must be enabled to continue!

Pre-trained language models to curate oligogenic data

Work presented for the workshop on Text mining services to support scalable curation. By pre-annotating 85 full-text articles containing the relevant oligogenic relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene–variant–gene–variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene–variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face.

F1000 Research Ltd

Charlotte Nachtegael Jacopo De Stefani Anthony Cnudde Tom Lenaerts

2025

Title: Pre-trained language models to curate oligogenic data

Description:

Work presented for the workshop on Text mining services to support scalable curation.

By pre-annotating 85 full-text articles containing the relevant oligogenic relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.

gene–variant–gene–variant, were extracted.

The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform.

The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT.

More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles.

When applied to gene–variant pair detection, BiomedBERT-large achieves the highest F1 score (0.

84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset.

DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants.

It is made freely available for research on GitHub and Hugging Face.

Back

<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...

Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga

The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...

Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program

Abstract Funding Acknowledgements Type of funding sources: None. INTRODUCTION Patients with heart failure (HF)...

DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations

Abstract While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from te...

Selection of Injectable Drug Product Composition using Machine Learning Models (Preprint)

BACKGROUND As of July 2020, a Web of Science search of “machine learning (ML)” nested within the search of “pharmacokinetics or pharmacodynamics” yielded over 100...

A Wideband mm-Wave Printed Dipole Antenna for 5G Applications

<span lang="EN-MY">In this paper, a wideband millimeter-wave (mm-Wave) printed dipole antenna is proposed to be used for fifth generation (5G) communications. The single elem...

Exploring Language Features of Male and Female Speakers in Pakistani TEDx Talks: A Corpus-based Comparative Analysis

The study explores the linguistic patterns in Pakistani TEDx Talks. It is based on gender-based language use. It consists of ten talks selected from YouTube and applies both quanti...

Navigating Language Ideologies Through Translanguaging in EAL Classrooms of Pakistan: A Sociolinguistics Perspective

Language is a tool for instructing and expressing a variety of perspectives. This study aimed to explore the ideologies navigated through translanguaging in Pakistani institutions ...

Email:
Password:

Email:

Pre-trained language models to curate oligogenic data

Related Results