Javascript must be enabled to continue!

Assessing text embedding models to assign UniProt classes to scientific literature

Advances in biomedical sciences are increasingly dependent on knowledge encoded in curated biomedical databases. In particular, the Universal Protein Resource (UniProt) provides the scientific community with a comprehensive and accurately annotated protein sequence knowledgebase. The organization of UniProt protein entries and their associated publications in different topics, such as expression, function and interaction, helps users to find the information of interest in the knowledgebase. The current topic classification approach for computationally mapped bibliography, based solely on underlying sources, is limited. We investigate the use of (semi-) automated classifiers for helping UniProt to classify the scientific biomedical literature according to 11 topics. Publications annotated by UniProt curators are labeled using one or more of the classes above. As such, this is a multi-class multi-label classification problem. Our algorithm works as follows: given a text passage, such as an article abstract, it provides a ranked list of classes and the probabilities that the passage belongs to the available classes. We investigate a text embedding model, Doc2Vec, to compute document similarity and compare several machine learning methods, such as Naïve Bayes, kNN, Logistic regression, and MLP, for assigning class similarities. We use a collection of 100,000 documents classified by the UniProt team to train (99,000) and test (1,000) our algorithms. The validation of the classifier parameters was performed using 5% of the training collection. We compare the text embedded models with a baseline model based on bag of words, where divergence from randomness is used to compute document similarity and kNN to assign classes. In general, our algorithms achieved a high classification precision. The baseline model achieved a mean average precision (MAP) of 0.8270. Apart from the Naïve Bayes model (MAP 0.7163), all the models based on the text embedding approach outperformed the baseline method. Logistic regression achieved a MAP of 0.8376 (p=.32), MLP achieved a MAP of 0.8413 (p=.18), and kNN achieved a MAP of 0.8485 (p=.04). We believe that such classifiers could help improving the productivity in and reduce the cost of certain biocuration tasks. The next steps will be the application of the methodology to unclassified documents and to assess its effectiveness in assisting curators judge the relevance of articles for further curation.

F1000 Research Ltd

Douglas Teodoro Luc Mottin Julien Gobeill Cecilia Arighi Patrick Ruch

2025

Title: Assessing text embedding models to assign UniProt classes to scientific literature

Description:

Advances in biomedical sciences are increasingly dependent on knowledge encoded in curated biomedical databases.

In particular, the Universal Protein Resource (UniProt) provides the scientific community with a comprehensive and accurately annotated protein sequence knowledgebase.

The organization of UniProt protein entries and their associated publications in different topics, such as expression, function and interaction, helps users to find the information of interest in the knowledgebase.

The current topic classification approach for computationally mapped bibliography, based solely on underlying sources, is limited.

We investigate the use of (semi-) automated classifiers for helping UniProt to classify the scientific biomedical literature according to 11 topics.

Publications annotated by UniProt curators are labeled using one or more of the classes above.

As such, this is a multi-class multi-label classification problem.

Our algorithm works as follows: given a text passage, such as an article abstract, it provides a ranked list of classes and the probabilities that the passage belongs to the available classes.

We investigate a text embedding model, Doc2Vec, to compute document similarity and compare several machine learning methods, such as Naïve Bayes, kNN, Logistic regression, and MLP, for assigning class similarities.

We use a collection of 100,000 documents classified by the UniProt team to train (99,000) and test (1,000) our algorithms.

The validation of the classifier parameters was performed using 5% of the training collection.

We compare the text embedded models with a baseline model based on bag of words, where divergence from randomness is used to compute document similarity and kNN to assign classes.

In general, our algorithms achieved a high classification precision.

The baseline model achieved a mean average precision (MAP) of 0.

8270.

Apart from the Naïve Bayes model (MAP 0.

7163), all the models based on the text embedding approach outperformed the baseline method.

Logistic regression achieved a MAP of 0.

8376 (p=.

32), MLP achieved a MAP of 0.

8413 (p=.

18), and kNN achieved a MAP of 0.

8485 (p=.

04).

We believe that such classifiers could help improving the productivity in and reduce the cost of certain biocuration tasks.

The next steps will be the application of the methodology to unclassified documents and to assess its effectiveness in assisting curators judge the relevance of articles for further curation.

Back

Abstract The Physical Activity Guidelines for Americans (Guidelines) advises older adults to be as active as possible. Yet, despite the well documented benefits of physical a...

Primerjalna književnost na prelomu tisočletja

In a comprehensive and at times critical manner, this volume seeks to shed light on the development of events in Western (i.e., European and North American) comparative literature ...

On Flores Island, do "ape-men" still exist? https://www.sapiens.org/biology/flores-island-ape-men/

<spa...

Exploring the topical structure of short text through probability models : from tasks to fundamentals

Recent technological advances have radically changed the way we communicate. Today’s communication has become ubiquitous and it has fostered the need for information that is easie...

E-Press and Oppress

From elephants to ABBA fans, silicon to hormone, the following discussion uses a new research method to look at printed text, motion pictures and a te...

Automated annotation in UniProt

UniProt is a high quality, comprehensive protein resource in which the core activity is the expert review and annotation of proteins where the function has been experimentally inve...

An Efficient ZZW Construction Using Low-Density Generator-Matrix Embedding Techniques

A novel steganographic algorithm based on ZZW construction is proposed to improve the steganographic embedding efficiency. Low-density generator-matrix (LDGM) embedding is an effic...

Λc Physics at BESIII

In 2014 BESIII collected a data sample of 567 [Formula: see text] at [Formula: see text] = 4.6 GeV, which is just above the [Formula: see text] pair production threshold. By analyz...

Email:
Password:

Email:

Assessing text embedding models to assign UniProt classes to scientific literature

Related Results