Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Performance evaluation of text augmentation methods with BERT on imbalanced datasets

View through CrossRef
[EMBARGOED UNTIL 6/1/2023] Recently deep learning methods have achieved great success in understanding and analyzing text messages. In real-world applications, however, labeled text data are often small-sized and imbalanced in classes due to the high cost of human annotation, limiting the performance of deep learning classifiers. Therefore, this study examines the effectiveness of Word2Vec and WordNet augmentation methods with BERT fine-tuning on datasets of various sizes (e.g., 500, 1,000, and 5,000 training documents) and imbalance ratios (e.g., 4:1 and 9:1). It compares them with other methods for imbalanced data, including boosting, SMOTE, and simple oversampling, combined with widely used machine learning models, including logistic regression, fully connected neural network, and LSTM. Experimental results show that Word2Vec augmentation improves the performance of BERT in detecting the minority class, and the improvement is most significantly (9 percent-30 percent recall increase compared to the base model and 11 percent-12 percent recall increase compared to the model with the oversampling method) when the data size is small (e.g., 500 training documents) and highly imbalanced (e.g., 9:1). When the data size increases or the imbalance ratio decreases, the improvement generated by the Word2Vec augmentation becomes smaller or insignificant. Moreover, Word2Vec augmentation plus BERT achieves the best performance compared to other models and methods, demonstrating a promising solution for small-sized, highly imbalanced text classification tasks.
University of Missouri Libraries
Title: Performance evaluation of text augmentation methods with BERT on imbalanced datasets
Description:
[EMBARGOED UNTIL 6/1/2023] Recently deep learning methods have achieved great success in understanding and analyzing text messages.
In real-world applications, however, labeled text data are often small-sized and imbalanced in classes due to the high cost of human annotation, limiting the performance of deep learning classifiers.
Therefore, this study examines the effectiveness of Word2Vec and WordNet augmentation methods with BERT fine-tuning on datasets of various sizes (e.
g.
, 500, 1,000, and 5,000 training documents) and imbalance ratios (e.
g.
, 4:1 and 9:1).
It compares them with other methods for imbalanced data, including boosting, SMOTE, and simple oversampling, combined with widely used machine learning models, including logistic regression, fully connected neural network, and LSTM.
Experimental results show that Word2Vec augmentation improves the performance of BERT in detecting the minority class, and the improvement is most significantly (9 percent-30 percent recall increase compared to the base model and 11 percent-12 percent recall increase compared to the model with the oversampling method) when the data size is small (e.
g.
, 500 training documents) and highly imbalanced (e.
g.
, 9:1).
When the data size increases or the imbalance ratio decreases, the improvement generated by the Word2Vec augmentation becomes smaller or insignificant.
Moreover, Word2Vec augmentation plus BERT achieves the best performance compared to other models and methods, demonstrating a promising solution for small-sized, highly imbalanced text classification tasks.

Related Results

Sleep Habits and Occurrence of Lowback Pain among Craftsmen
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...
Enhancing Non-Formal Learning Certificate Classification with Text Augmentation: A Comparison of Character, Token, and Semantic Approaches
Enhancing Non-Formal Learning Certificate Classification with Text Augmentation: A Comparison of Character, Token, and Semantic Approaches
Aim/Purpose: The purpose of this paper is to address the gap in the recognition of prior learning (RPL) by automating the classification of non-formal learning certificates using d...
Bounds on the sum of broadcast domination number and strong metric dimension of graphs
Bounds on the sum of broadcast domination number and strong metric dimension of graphs
Let [Formula: see text] be a connected graph of order at least two with vertex set [Formula: see text]. For [Formula: see text], let [Formula: see text] denote the length of an [Fo...
ANALYSIS OF READING MATERIALS IN TEXTBOOK FOR GRADE XI SENIOR HIGH SCHOOL
ANALYSIS OF READING MATERIALS IN TEXTBOOK FOR GRADE XI SENIOR HIGH SCHOOL
This study aims to find out the GI and LD level, the text which has the highest GI and LD and what make the text has the highest GI and LD of Advanced Learning English 2 textbook. ...
A Pre-Training Technique to Localize Medical BERT and to Enhance Biomedical BERT
A Pre-Training Technique to Localize Medical BERT and to Enhance Biomedical BERT
Abstract Background: Pre-training large-scale neural language models on raw texts has been shown to make a significant contribution to a strategy for transfer learning in n...
Non-Recommended Publishing Lists: Strategies for Detecting Deceitful Journals
Non-Recommended Publishing Lists: Strategies for Detecting Deceitful Journals
Abstract The rapid growth of open access publishing (OAP) has significantly improved the accessibility and dissemination of scientific knowledge. However, this expansion has also c...

Back to Top