Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

A Conditional Random Field Approach for Named Entity Recognition in Bengali and Hindi

View through CrossRef
This paper describes the development of Named Entity Recognition (NER) systems for two leading Indian languages, namely Bengali and Hindi, using the Conditional Random Field (CRF) framework. The system makes use of different types of contextual information along with a variety of features that are helpful in predicting the different named entity (NE) classes. This set of features includes language independent as well as language dependent components. We have used the annotated corpora of 122,467 tokens for Bengali and 502,974 tokens for Hindi tagged with a tag set of twelve different NE classes, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL). We have considered only the tags that denote person names, location names, organization names, number expressions, time expressions and measurement expressions. A number of experiments have been carried out in order to find out the most suitable features for NER in Bengali and Hindi. The system has been tested with the gold standard test sets of 35K for Bengali and 50K tokens for Hindi. Evaluation results in overall f-score values of 81.15% for Bengali and 78.29% for Hindi for the test sets. 10-fold cross validation tests yield f-score values of 83.89% for Bengali and 80.93% for Hindi. ANOVA analysis is performed to show that the performance improvement due to the use of language dependent features is statistically significant.
Title: A Conditional Random Field Approach for Named Entity Recognition in Bengali and Hindi
Description:
This paper describes the development of Named Entity Recognition (NER) systems for two leading Indian languages, namely Bengali and Hindi, using the Conditional Random Field (CRF) framework.
The system makes use of different types of contextual information along with a variety of features that are helpful in predicting the different named entity (NE) classes.
This set of features includes language independent as well as language dependent components.
We have used the annotated corpora of 122,467 tokens for Bengali and 502,974 tokens for Hindi tagged with a tag set of twelve different NE classes, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL).
We have considered only the tags that denote person names, location names, organization names, number expressions, time expressions and measurement expressions.
A number of experiments have been carried out in order to find out the most suitable features for NER in Bengali and Hindi.
The system has been tested with the gold standard test sets of 35K for Bengali and 50K tokens for Hindi.
Evaluation results in overall f-score values of 81.
15% for Bengali and 78.
29% for Hindi for the test sets.
10-fold cross validation tests yield f-score values of 83.
89% for Bengali and 80.
93% for Hindi.
ANOVA analysis is performed to show that the performance improvement due to the use of language dependent features is statistically significant.

Related Results

Literary History of Bengal, 8th-19th century AD
Literary History of Bengal, 8th-19th century AD
The literary history of Bengal is characterized by a multilingual ecology that nurtured the development of Middle Bengali literature. It is around the turn of the second millennium...
Dynamics of Mutations in Patients with ET Treated with Imetelstat
Dynamics of Mutations in Patients with ET Treated with Imetelstat
Abstract Background: Imetelstat, a first in class specific telomerase inhibitor, induced hematologic responses in all patients (pts) with essential thrombocythemia (...
Rabindranath Tagore
Rabindranath Tagore
Rabindranath Tagore (in Bengali: Rabīndranāth Ṭhākur; b. 1861–d. 1941) was born in Calcutta, the capital of British India at the time. From the late 18th century onward, his extrem...
Global Connections of Raw Silk Production in 18th and 19th-Century Bengal
Global Connections of Raw Silk Production in 18th and 19th-Century Bengal
Bengal was a major global raw silk market player between the early 17th and mid-19th centuries. During this period, Bengal supplied both Asian and European markets with raw silk. T...
MLNet: a multi-level multimodal named entity recognition architecture
MLNet: a multi-level multimodal named entity recognition architecture
In the field of human–computer interaction, accurate identification of talking objects can help robots to accomplish subsequent tasks such as decision-making or recommendation; the...
The Making of Modern Hindi
The Making of Modern Hindi
The Making of Modern Hindi examines the politics and processes of making Hindi modern at a formative moment in India’s history, when British imperialism was at its peak and anti-co...
Unsupervised entity linking using graph-based semantic similarity
Unsupervised entity linking using graph-based semantic similarity
Nowadays, the human textual data constitutes a great proportion of the shared information resources such as World Wide Web (WWW). Social networks, news and learning resources as we...

Back to Top