Javascript must be enabled to continue!

Calculating Semantic Frequency of GSL Words Using a BERT Model in Large Corpora

There has always been a pressing need to provide semantic information for words in high-frequency word lists, but technical limitations have hindered this goal. This study addresses this challenge by leveraging a large language model, such as BERT, to semantically annotate large corpora and identify the high-frequency senses of headwords from the General Service List (GSL). We aim to explore three key questions: (1) Can BERT automatically annotate large corpora and accurately calculate sense frequencies? (2) What are the high-frequency senses of GSL words? (3) Can this approach be verified? Using a BERT-based framework, we annotated 1,891 GSL headwords (10,925 senses) in the 100-million-word British National Corpus (BNC), representing each sense with a 1,024-dimensional vector. From this, we identified 3,695 high-frequency senses for the GSL words. Three main conclusions are drawn from this study. First, BERT demonstrates high accuracy in sense annotation, achieving 92% precision when disambiguating the senses of GSL words. Second, a relatively small number of high-frequency senses account for a significant portion of corpus coverage. Specifically, these high-frequency senses (33.8% of the total) cover approximately 60% of all GSL word occurrences in the BNC. Third, the high-frequency senses selected via this method can be verified by their consistent coverage across different corpora. This study illustrates a pioneering method for semantic annotation in large corpora, which can be easily applied to calculate semantic frequencies for other word lists.

SAGE Publications

Liu Lei Gong Tongxi Shi Jianjun Guo Yi

SAGE Open

2025

Title: Calculating Semantic Frequency of GSL Words Using a BERT Model in Large Corpora

Description:

There has always been a pressing need to provide semantic information for words in high-frequency word lists, but technical limitations have hindered this goal.

This study addresses this challenge by leveraging a large language model, such as BERT, to semantically annotate large corpora and identify the high-frequency senses of headwords from the General Service List (GSL).

We aim to explore three key questions: (1) Can BERT automatically annotate large corpora and accurately calculate sense frequencies? (2) What are the high-frequency senses of GSL words? (3) Can this approach be verified? Using a BERT-based framework, we annotated 1,891 GSL headwords (10,925 senses) in the 100-million-word British National Corpus (BNC), representing each sense with a 1,024-dimensional vector.

From this, we identified 3,695 high-frequency senses for the GSL words.

Three main conclusions are drawn from this study.

First, BERT demonstrates high accuracy in sense annotation, achieving 92% precision when disambiguating the senses of GSL words.

Second, a relatively small number of high-frequency senses account for a significant portion of corpus coverage.

Specifically, these high-frequency senses (33.

8% of the total) cover approximately 60% of all GSL word occurrences in the BNC.

Third, the high-frequency senses selected via this method can be verified by their consistent coverage across different corpora.

This study illustrates a pioneering method for semantic annotation in large corpora, which can be easily applied to calculate semantic frequencies for other word lists.

Back

BACKGROUND Pre-training large-scale neural language models on raw texts has made a significant contribution to improving transfer learning in natural langua...

A Semantic Orthogonal Mapping Method Through Deep-Learning for Semantic Computing

In order to realize an artificial intelligent system, a basic mechanism should be provided for expressing and processing the semantic. We have presented semantic computing models i...

The exploration of semantic similarity is a fundamental aspect of natural language processing, as it aids in comprehending the significance and usage of vocabulary present in a lan...

A Pre-Training Technique to Localize Medical BERT and to Enhance Biomedical BERT

Abstract Background: Pre-training large-scale neural language models on raw texts has been shown to make a significant contribution to a strategy for transfer learning in n...

Molecular identification of microorganisms associated with the brine shrimp Artemia franciscana

Abstract Background Prior research on the microorganisms associated with the brine shrimp, Artemia franciscana, has mainly been limited to cultur...

UMA PROPOSTA DE WORKFLOW PARA CONSTRUÇÃO DE CORPUS DIGITAL EM LÍNGUA DE SINAIS

Os corpora de línguas de sinais disponíveis atualmente em pesquisas linguísticas e em sites para acesso livre são constituídos por um módulo de gravação feita em vídeo, pois os dad...

A Taste for Corpora

The eleven contributions to this volume, written by expert corpus linguists, tackle corpora from a wide range of perspectives and aim to shed light on the numerous linguistic and p...

A Research Overview of Corpus-Assisted Enhancement of English Writing Proficiency

With the deep development of big data and artificial intelligence technologies, corpus research has garnered increasing attention and recognition. Initially, corpora were collectio...

Email:
Password:

Email:

Calculating Semantic Frequency of GSL Words Using a BERT Model in Large Corpora

Related Results