Javascript must be enabled to continue!
Assessing text embedding models to assign UniProt classes to scientific literature
View through CrossRef
Advances in biomedical sciences are increasingly dependent on knowledge encoded in curated biomedical databases. In particular, the Universal Protein Resource (UniProt) provides the scientific community with a comprehensive and accurately annotated protein sequence knowledgebase. The organization of UniProt protein entries and their associated publications in different topics, such as expression, function and interaction, helps users to find the information of interest in the knowledgebase. The current topic classification approach for computationally mapped bibliography, based solely on underlying sources, is limited. We investigate the use of (semi-) automated classifiers for helping UniProt to classify the scientific biomedical literature according to 11 topics. Publications annotated by UniProt curators are labeled using one or more of the classes above. As such, this is a multi-class multi-label classification problem. Our algorithm works as follows: given a text passage, such as an article abstract, it provides a ranked list of classes and the probabilities that the passage belongs to the available classes. We investigate a text embedding model, Doc2Vec, to compute document similarity and compare several machine learning methods, such as Naïve Bayes, kNN, Logistic regression, and MLP, for assigning class similarities. We use a collection of 100,000 documents classified by the UniProt team to train (99,000) and test (1,000) our algorithms. The validation of the classifier parameters was performed using 5% of the training collection. We compare the text embedded models with a baseline model based on bag of words, where divergence from randomness is used to compute document similarity and kNN to assign classes. In general, our algorithms achieved a high classification precision. The baseline model achieved a mean average precision (MAP) of 0.8270. Apart from the Naïve Bayes model (MAP 0.7163), all the models based on the text embedding approach outperformed the baseline method. Logistic regression achieved a MAP of 0.8376 (p=.32), MLP achieved a MAP of 0.8413 (p=.18), and kNN achieved a MAP of 0.8485 (p=.04). We believe that such classifiers could help improving the productivity in and reduce the cost of certain biocuration tasks. The next steps will be the application of the methodology to unclassified documents and to assess its effectiveness in assisting curators judge the relevance of articles for further curation.
Title: Assessing text embedding models to assign UniProt classes to scientific literature
Description:
Advances in biomedical sciences are increasingly dependent on knowledge encoded in curated biomedical databases.
In particular, the Universal Protein Resource (UniProt) provides the scientific community with a comprehensive and accurately annotated protein sequence knowledgebase.
The organization of UniProt protein entries and their associated publications in different topics, such as expression, function and interaction, helps users to find the information of interest in the knowledgebase.
The current topic classification approach for computationally mapped bibliography, based solely on underlying sources, is limited.
We investigate the use of (semi-) automated classifiers for helping UniProt to classify the scientific biomedical literature according to 11 topics.
Publications annotated by UniProt curators are labeled using one or more of the classes above.
As such, this is a multi-class multi-label classification problem.
Our algorithm works as follows: given a text passage, such as an article abstract, it provides a ranked list of classes and the probabilities that the passage belongs to the available classes.
We investigate a text embedding model, Doc2Vec, to compute document similarity and compare several machine learning methods, such as Naïve Bayes, kNN, Logistic regression, and MLP, for assigning class similarities.
We use a collection of 100,000 documents classified by the UniProt team to train (99,000) and test (1,000) our algorithms.
The validation of the classifier parameters was performed using 5% of the training collection.
We compare the text embedded models with a baseline model based on bag of words, where divergence from randomness is used to compute document similarity and kNN to assign classes.
In general, our algorithms achieved a high classification precision.
The baseline model achieved a mean average precision (MAP) of 0.
8270.
Apart from the Naïve Bayes model (MAP 0.
7163), all the models based on the text embedding approach outperformed the baseline method.
Logistic regression achieved a MAP of 0.
8376 (p=.
32), MLP achieved a MAP of 0.
8413 (p=.
18), and kNN achieved a MAP of 0.
8485 (p=.
04).
We believe that such classifiers could help improving the productivity in and reduce the cost of certain biocuration tasks.
The next steps will be the application of the methodology to unclassified documents and to assess its effectiveness in assisting curators judge the relevance of articles for further curation.
Related Results
Evaluating the Science to Inform the Physical Activity Guidelines for Americans Midcourse Report
Evaluating the Science to Inform the Physical Activity Guidelines for Americans Midcourse Report
Abstract
The Physical Activity Guidelines for Americans (Guidelines) advises older adults to be as active as possible. Yet, despite the well documented benefits of physical a...
Primerjalna književnost na prelomu tisočletja
Primerjalna književnost na prelomu tisočletja
In a comprehensive and at times critical manner, this volume seeks to shed light on the development of events in Western (i.e., European and North American) comparative literature ...
On Flores Island, do "ape-men" still exist? https://www.sapiens.org/biology/flores-island-ape-men/
On Flores Island, do "ape-men" still exist? https://www.sapiens.org/biology/flores-island-ape-men/
<span style="font-size:11pt"><span style="background:#f9f9f4"><span style="line-height:normal"><span style="font-family:Calibri,sans-serif"><b><spa...
Exploring the topical structure of short text through probability models : from tasks to fundamentals
Exploring the topical structure of short text through probability models : from tasks to fundamentals
Recent technological advances have radically changed the way we communicate. Today’s
communication has become ubiquitous and it has fostered the need for information that is easie...
E-Press and Oppress
E-Press and Oppress
From elephants to ABBA fans, silicon to hormone, the following discussion uses a new research method to look at printed text, motion pictures and a te...
Automated annotation in UniProt
Automated annotation in UniProt
UniProt is a high quality, comprehensive protein resource in which the core activity is the expert review and annotation of proteins where the function has been experimentally inve...
An Efficient ZZW Construction Using Low-Density Generator-Matrix Embedding Techniques
An Efficient ZZW Construction Using Low-Density Generator-Matrix Embedding Techniques
A novel steganographic algorithm based on ZZW construction is proposed to improve the steganographic embedding efficiency. Low-density generator-matrix (LDGM) embedding is an effic...
Λc Physics at BESIII
Λc Physics at BESIII
In 2014 BESIII collected a data sample of 567 [Formula: see text] at [Formula: see text] = 4.6 GeV, which is just above the [Formula: see text] pair production threshold. By analyz...

