Javascript must be enabled to continue!
Pre-trained language models to curate oligogenic data
View through CrossRef
Work presented for the workshop on Text mining services to support scalable curation.
By pre-annotating 85 full-text articles containing the relevant oligogenic relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene–variant–gene–variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene–variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face.
Title: Pre-trained language models to curate oligogenic data
Description:
Work presented for the workshop on Text mining services to support scalable curation.
By pre-annotating 85 full-text articles containing the relevant oligogenic relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.
e.
gene–variant–gene–variant, were extracted.
The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform.
The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT.
More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles.
When applied to gene–variant pair detection, BiomedBERT-large achieves the highest F1 score (0.
84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset.
DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants.
It is made freely available for research on GitHub and Hugging Face.
Related Results
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...
Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program
Increased life expectancy of heart failure patients in a rural center by a multidisciplinary program
Abstract
Funding Acknowledgements
Type of funding sources: None.
INTRODUCTION Patients with heart failure (HF)...
DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations
DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations
Abstract
While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from te...
Selection of Injectable Drug Product Composition using Machine Learning Models (Preprint)
Selection of Injectable Drug Product Composition using Machine Learning Models (Preprint)
BACKGROUND
As of July 2020, a Web of Science search of “machine learning (ML)” nested within the search of “pharmacokinetics or pharmacodynamics” yielded over 100...
A Wideband mm-Wave Printed Dipole Antenna for 5G Applications
A Wideband mm-Wave Printed Dipole Antenna for 5G Applications
<span lang="EN-MY">In this paper, a wideband millimeter-wave (mm-Wave) printed dipole antenna is proposed to be used for fifth generation (5G) communications. The single elem...
Exploring Language Features of Male and Female Speakers in Pakistani TEDx Talks: A Corpus-based Comparative Analysis
Exploring Language Features of Male and Female Speakers in Pakistani TEDx Talks: A Corpus-based Comparative Analysis
The study explores the linguistic patterns in Pakistani TEDx Talks. It is based on gender-based language use. It consists of ten talks selected from YouTube and applies both quanti...
Navigating Language Ideologies Through Translanguaging in EAL Classrooms of Pakistan: A Sociolinguistics Perspective
Navigating Language Ideologies Through Translanguaging in EAL Classrooms of Pakistan: A Sociolinguistics Perspective
Language is a tool for instructing and expressing a variety of perspectives. This study aimed to explore the ideologies navigated through translanguaging in Pakistani institutions ...

