Javascript must be enabled to continue!

A Conditional Random Field Approach for Named Entity Recognition in Bengali and Hindi

This paper describes the development of Named Entity Recognition (NER) systems for two leading Indian languages, namely Bengali and Hindi, using the Conditional Random Field (CRF) framework. The system makes use of different types of contextual information along with a variety of features that are helpful in predicting the different named entity (NE) classes. This set of features includes language independent as well as language dependent components. We have used the annotated corpora of 122,467 tokens for Bengali and 502,974 tokens for Hindi tagged with a tag set of twelve different NE classes, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL). We have considered only the tags that denote person names, location names, organization names, number expressions, time expressions and measurement expressions. A number of experiments have been carried out in order to find out the most suitable features for NER in Bengali and Hindi. The system has been tested with the gold standard test sets of 35K for Bengali and 50K tokens for Hindi. Evaluation results in overall f-score values of 81.15% for Bengali and 78.29% for Hindi for the test sets. 10-fold cross validation tests yield f-score values of 83.89% for Bengali and 80.93% for Hindi. ANOVA analysis is performed to show that the performance improvement due to the use of language dependent features is statistically significant.

University of Colorado at Boulder

Asif Ekbal Sivaji Bandyopadhyay

Linguistic Issues in Language Technology

2021

Title: A Conditional Random Field Approach for Named Entity Recognition in Bengali and Hindi

Description:

This paper describes the development of Named Entity Recognition (NER) systems for two leading Indian languages, namely Bengali and Hindi, using the Conditional Random Field (CRF) framework.

The system makes use of different types of contextual information along with a variety of features that are helpful in predicting the different named entity (NE) classes.

This set of features includes language independent as well as language dependent components.

We have used the annotated corpora of 122,467 tokens for Bengali and 502,974 tokens for Hindi tagged with a tag set of twelve different NE classes, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL).

We have considered only the tags that denote person names, location names, organization names, number expressions, time expressions and measurement expressions.

A number of experiments have been carried out in order to find out the most suitable features for NER in Bengali and Hindi.

The system has been tested with the gold standard test sets of 35K for Bengali and 50K tokens for Hindi.

Evaluation results in overall f-score values of 81.

15% for Bengali and 78.

29% for Hindi for the test sets.

10-fold cross validation tests yield f-score values of 83.

89% for Bengali and 80.

93% for Hindi.

ANOVA analysis is performed to show that the performance improvement due to the use of language dependent features is statistically significant.

Back

Bengali ranks among the first ten spoken languages in the world with a native speaker numbering about 230 million people. With UNESCO declaring 21st February as International Moth...

Efficacy of an Extended Half-Life GlycoPEGylated rFVIII (N8-GP): Pooled Analysis of ABR (Results from Two Clinical Trials)

Abstract Introduction The short half-life of standard factor VIII (FVIII) products means that frequent injections (3 to 4 times/week) are needed for e...

A Web-based Intelligent Handwriting Education System for Autonomous Learning of Bengali Characters

In this paper, we describe a prototype of web-based intelligent handwriting education system for autonomous learning of Bengali characters. Bengali language is used by more than 21...

Reform and Change in Early 20th Century Bengali Society: A Study of Chattopadhyay's Novel Nishkriti

The goal of this research is to examine the societal reforms and modifications that took place in early 20th-century Bengal as a result of the flourishing Bengali Renaissance, as p...

A Phase 1b, Dose-Finding Study Of Ruxolitinib Plus Panobinostat In Patients With Primary Myelofibrosis (PMF), Post–Polycythemia Vera MF (PPV-MF), Or Post–Essential Thrombocythemia MF (PET-MF): Identification Of The Recommended Phase 2 Dose

Abstract Background Myelofibrosis (MF) is a myeloproliferative neoplasm associated with progressive, debilitating symptoms that ...

Maithili Language and the Movement, Part–I

This chapter examines the ways through which the Maithili movement became more provocative and assertive from the beginning of the 1920s until the independence of India. It begins ...

Pitch range and focus in Hindi.

A study of intonation in Hindi included sentences, read by six native speakers, in which different individual words were emphasized or focused. Hindi resembles some other languages...

Rabindranath Tagore

Rabindranath Tagore (in Bengali: Rabīndranāth Ṭhākur; b. 1861–d. 1941) was born in Calcutta, the capital of British India at the time. From the late 18th century onward, his extrem...

Email:
Password:

Email:

A Conditional Random Field Approach for Named Entity Recognition in Bengali and Hindi

Related Results