Javascript must be enabled to continue!
Calculating Semantic Frequency of GSL Words Using a BERT Model in Large Corpora
View through CrossRef
There has always been a pressing need to provide semantic information for words in high-frequency word lists, but technical limitations have hindered this goal. This study addresses this challenge by leveraging a large language model, such as BERT, to semantically annotate large corpora and identify the high-frequency senses of headwords from the General Service List (GSL). We aim to explore three key questions: (1) Can BERT automatically annotate large corpora and accurately calculate sense frequencies? (2) What are the high-frequency senses of GSL words? (3) Can this approach be verified? Using a BERT-based framework, we annotated 1,891 GSL headwords (10,925 senses) in the 100-million-word British National Corpus (BNC), representing each sense with a 1,024-dimensional vector. From this, we identified 3,695 high-frequency senses for the GSL words. Three main conclusions are drawn from this study. First, BERT demonstrates high accuracy in sense annotation, achieving 92% precision when disambiguating the senses of GSL words. Second, a relatively small number of high-frequency senses account for a significant portion of corpus coverage. Specifically, these high-frequency senses (33.8% of the total) cover approximately 60% of all GSL word occurrences in the BNC. Third, the high-frequency senses selected via this method can be verified by their consistent coverage across different corpora. This study illustrates a pioneering method for semantic annotation in large corpora, which can be easily applied to calculate semantic frequencies for other word lists.
Title: Calculating Semantic Frequency of GSL Words Using a BERT Model in Large Corpora
Description:
There has always been a pressing need to provide semantic information for words in high-frequency word lists, but technical limitations have hindered this goal.
This study addresses this challenge by leveraging a large language model, such as BERT, to semantically annotate large corpora and identify the high-frequency senses of headwords from the General Service List (GSL).
We aim to explore three key questions: (1) Can BERT automatically annotate large corpora and accurately calculate sense frequencies? (2) What are the high-frequency senses of GSL words? (3) Can this approach be verified? Using a BERT-based framework, we annotated 1,891 GSL headwords (10,925 senses) in the 100-million-word British National Corpus (BNC), representing each sense with a 1,024-dimensional vector.
From this, we identified 3,695 high-frequency senses for the GSL words.
Three main conclusions are drawn from this study.
First, BERT demonstrates high accuracy in sense annotation, achieving 92% precision when disambiguating the senses of GSL words.
Second, a relatively small number of high-frequency senses account for a significant portion of corpus coverage.
Specifically, these high-frequency senses (33.
8% of the total) cover approximately 60% of all GSL word occurrences in the BNC.
Third, the high-frequency senses selected via this method can be verified by their consistent coverage across different corpora.
This study illustrates a pioneering method for semantic annotation in large corpora, which can be easily applied to calculate semantic frequencies for other word lists.
Related Results
Over-Sampling Effect in Pre-Training for Bidirectional Encoder Representations from Transformers (BERT) to Localize Medical BERT and Enhance Biomedical BERT (Preprint)
Over-Sampling Effect in Pre-Training for Bidirectional Encoder Representations from Transformers (BERT) to Localize Medical BERT and Enhance Biomedical BERT (Preprint)
BACKGROUND
Pre-training large-scale neural language models on raw texts has made a significant contribution to improving transfer learning in natural langua...
A Semantic Orthogonal Mapping Method Through Deep-Learning for Semantic Computing
A Semantic Orthogonal Mapping Method Through Deep-Learning for Semantic Computing
In order to realize an artificial intelligent system, a basic mechanism should be provided for expressing and processing the semantic. We have presented semantic computing models i...
Semantic Similarity Caculating based on BERT
Semantic Similarity Caculating based on BERT
The exploration of semantic similarity is a fundamental aspect of natural language processing, as it aids in comprehending the significance and usage of vocabulary present in a lan...
A Pre-Training Technique to Localize Medical BERT and to Enhance Biomedical BERT
A Pre-Training Technique to Localize Medical BERT and to Enhance Biomedical BERT
Abstract
Background: Pre-training large-scale neural language models on raw texts has been shown to make a significant contribution to a strategy for transfer learning in n...
UMA PROPOSTA DE WORKFLOW PARA CONSTRUÇÃO DE CORPUS DIGITAL EM LÍNGUA DE SINAIS
UMA PROPOSTA DE WORKFLOW PARA CONSTRUÇÃO DE CORPUS DIGITAL EM LÍNGUA DE SINAIS
Os corpora de línguas de sinais disponíveis atualmente em pesquisas linguísticas e em sites para acesso livre são constituídos por um módulo de gravação feita em vídeo, pois os dad...
A Taste for Corpora
A Taste for Corpora
The eleven contributions to this volume, written by expert corpus linguists, tackle corpora from a wide range of perspectives and aim to shed light on the numerous linguistic and p...
Semantic Excel: An Introduction to a User-Friendly Online Software Application for Statistical Analyses of Text Data
Semantic Excel: An Introduction to a User-Friendly Online Software Application for Statistical Analyses of Text Data
Semantic Excel (www.semanticexcel.com) is an online software application with a simple, yet powerful interface enabling users to perform statistical analyses on texts. The purpose ...
Functional Regulation of TFPI in Membrane Lipid Rafts.
Functional Regulation of TFPI in Membrane Lipid Rafts.
Abstract
The assembly of tissue factor-factor VIIa (TF-FVIIa) complex results in the proteolytic activation of factors X (FX) and IX and ultimate thrombin generation...

