Javascript must be enabled to continue!
Updating and extending the concept annotations of the CRAFT corpus
View through CrossRef
With the ever-rising amount of biomedical literature, it is increasingly difficult for scientists to keep up with the published work in their fields of research, much less related ones. The use of natural language processing (NLP) tools can make the literature more accessible by aiding concept recognition and information extraction. As NLP-based approaches have been increasingly used for biocuration, so too have biomedical ontologies, whose use enables semantic integration across disparate curated resources, and millions of biomedical entities have been annotated with them. Particularly important are the Open Biomedical Ontologies (OBOs), a set of open, orthogonal, interoperable ontologies formally representing knowledge over a wide range of biology, medicine, and related disciplines.
Manually annotated document corpora have become critical gold-standard resources for the training and testing of biomedical NLP systems. This was the motivation for the creation of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access journal articles from the biomedical literature. Within these articles, each mention of the concepts explicitly represented in eight prominent OBOs has been annotated, resulting in gold-standard markup of genes and gene products, chemicals and molecular entities, biomacromolecular sequence features, cells and cellular and extracellular components and locations, organisms, biological processes and molecular functionalities. With these ~100,000 concept annotations among the ~800,000 words in the 67 articles of the 1.0 release, it is one of the largest gold-standard biomedical semantically annotated corpora. In addition to this substantial conceptual markup, the corpus is fully annotated along a number of syntactic and other axes, notably by sentence segmentation, tokenization, part-of-speech tagging, syntactic parsing, text formatting, and document sectioning.
In the several years since the initial release of the CRAFT Corpus, in addition to efforts within our group and in collaboration with others, including the first comprehensive gold-standard evaluation of current prominent concept-recognition systems, it has already been used in multiple external projects to drive development of higher-performing systems. Here we present our continuing work on the corpus along several fronts. First, to keep the corpus relevant, we are updating the concept annotations using newer versions of the ontologies already used to mark up the articles, removing annotations of obsoleted classes and editing previous annotations or creating new annotations of newly added classes. Additionally, to extend the domain of annotated concept types, we are also marking up mentions of concepts using the Molecular Process Ontology (for types of chemical processes) and the Uberon Anatomy Ontology (for anatomical components and life-cycle stages). Finally, to capture even more content, we are generating new annotations for roots of prefixed/suffixed words as well as annotations made with extension classes we have created. We will present updated annotation counts and interannotator agreement statistics for these continuing efforts as well as future plans. All of this work is designed to further increase the potential of the CRAFT Corpus to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems.
Title: Updating and extending the concept annotations of the CRAFT corpus
Description:
With the ever-rising amount of biomedical literature, it is increasingly difficult for scientists to keep up with the published work in their fields of research, much less related ones.
The use of natural language processing (NLP) tools can make the literature more accessible by aiding concept recognition and information extraction.
As NLP-based approaches have been increasingly used for biocuration, so too have biomedical ontologies, whose use enables semantic integration across disparate curated resources, and millions of biomedical entities have been annotated with them.
Particularly important are the Open Biomedical Ontologies (OBOs), a set of open, orthogonal, interoperable ontologies formally representing knowledge over a wide range of biology, medicine, and related disciplines.
Manually annotated document corpora have become critical gold-standard resources for the training and testing of biomedical NLP systems.
This was the motivation for the creation of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access journal articles from the biomedical literature.
Within these articles, each mention of the concepts explicitly represented in eight prominent OBOs has been annotated, resulting in gold-standard markup of genes and gene products, chemicals and molecular entities, biomacromolecular sequence features, cells and cellular and extracellular components and locations, organisms, biological processes and molecular functionalities.
With these ~100,000 concept annotations among the ~800,000 words in the 67 articles of the 1.
0 release, it is one of the largest gold-standard biomedical semantically annotated corpora.
In addition to this substantial conceptual markup, the corpus is fully annotated along a number of syntactic and other axes, notably by sentence segmentation, tokenization, part-of-speech tagging, syntactic parsing, text formatting, and document sectioning.
In the several years since the initial release of the CRAFT Corpus, in addition to efforts within our group and in collaboration with others, including the first comprehensive gold-standard evaluation of current prominent concept-recognition systems, it has already been used in multiple external projects to drive development of higher-performing systems.
Here we present our continuing work on the corpus along several fronts.
First, to keep the corpus relevant, we are updating the concept annotations using newer versions of the ontologies already used to mark up the articles, removing annotations of obsoleted classes and editing previous annotations or creating new annotations of newly added classes.
Additionally, to extend the domain of annotated concept types, we are also marking up mentions of concepts using the Molecular Process Ontology (for types of chemical processes) and the Uberon Anatomy Ontology (for anatomical components and life-cycle stages).
Finally, to capture even more content, we are generating new annotations for roots of prefixed/suffixed words as well as annotations made with extension classes we have created.
We will present updated annotation counts and interannotator agreement statistics for these continuing efforts as well as future plans.
All of this work is designed to further increase the potential of the CRAFT Corpus to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems.
Related Results
Updating and extending the concept annotations of the CRAFT corpus
Updating and extending the concept annotations of the CRAFT corpus
With the ever-rising amount of biomedical literature, it is increasingly difficult for scientists to keep up with the published work in their fields of research, much less related ...
Gene function finding through cross-organism ensemble learning
Gene function finding through cross-organism ensemble learning
Abstract
Background
Structured biological information about genes and proteins is a valuable resource to improve discovery and understanding of comp...
Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika
Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika
Genre implies formal and stylistic conventions of a particular text type, which inevitably affects the translation process. This „force of genre bias“ (Prieto Ramos, 2014) has been...
Craft interests during leisure time and craft learning outcomes in Finland
Craft interests during leisure time and craft learning outcomes in Finland
Abstract
The Finnish National Board of Education (FNBE) evaluated learning outcomes in craft in the final ninth grade of compulsory education in March 2010. The eval...
Kestävä kädenjälki käsityössä
Kestävä kädenjälki käsityössä
Käsitepari kestävä käsityö on syntynyt käsityön vastaukseksi kestävän kehityksen haasteisiin. Tässä teoreettisessa kirjallisuuskatsauksessa tarkastelemme käsityötä mahdollisuutena ...
MAISA - Maintenance of semantic annotations
MAISA - Maintenance of semantic annotations
MAISA - Maintenance des annotations sémantiques
Les annotations sémantiques sont utilisées dans de nombreux domaines comme celui de la santé et servent à différente...
Concept-based and relation-based corpus navigation : applications of natural language processing in digital humanities
Concept-based and relation-based corpus navigation : applications of natural language processing in digital humanities
Navigation en corpus fondée sur les concepts et les relations : applications du traitement automatique des langues aux humanités numériques
La recherche en Sciences...
Craft in economic context: The representation of Finnish craft in the economic press
Craft in economic context: The representation of Finnish craft in the economic press
The discipline of design is constantly reshaping itself. In the case of craft, it is increasingly discussed in the realm of the economic world, although craft is normally associate...

