Javascript must be enabled to continue!

Performance evaluation of text augmentation methods with BERT on imbalanced datasets

[EMBARGOED UNTIL 6/1/2023] Recently deep learning methods have achieved great success in understanding and analyzing text messages. In real-world applications, however, labeled text data are often small-sized and imbalanced in classes due to the high cost of human annotation, limiting the performance of deep learning classifiers. Therefore, this study examines the effectiveness of Word2Vec and WordNet augmentation methods with BERT fine-tuning on datasets of various sizes (e.g., 500, 1,000, and 5,000 training documents) and imbalance ratios (e.g., 4:1 and 9:1). It compares them with other methods for imbalanced data, including boosting, SMOTE, and simple oversampling, combined with widely used machine learning models, including logistic regression, fully connected neural network, and LSTM. Experimental results show that Word2Vec augmentation improves the performance of BERT in detecting the minority class, and the improvement is most significantly (9 percent-30 percent recall increase compared to the base model and 11 percent-12 percent recall increase compared to the model with the oversampling method) when the data size is small (e.g., 500 training documents) and highly imbalanced (e.g., 9:1). When the data size increases or the imbalance ratio decreases, the improvement generated by the Word2Vec augmentation becomes smaller or insignificant. Moreover, Word2Vec augmentation plus BERT achieves the best performance compared to other models and methods, demonstrating a promising solution for small-sized, highly imbalanced text classification tasks.

University of Missouri Libraries

Lingshu Hu

2022

Title: Performance evaluation of text augmentation methods with BERT on imbalanced datasets

Description:

[EMBARGOED UNTIL 6/1/2023] Recently deep learning methods have achieved great success in understanding and analyzing text messages.

In real-world applications, however, labeled text data are often small-sized and imbalanced in classes due to the high cost of human annotation, limiting the performance of deep learning classifiers.

Therefore, this study examines the effectiveness of Word2Vec and WordNet augmentation methods with BERT fine-tuning on datasets of various sizes (e.

, 500, 1,000, and 5,000 training documents) and imbalance ratios (e.

, 4:1 and 9:1).

It compares them with other methods for imbalanced data, including boosting, SMOTE, and simple oversampling, combined with widely used machine learning models, including logistic regression, fully connected neural network, and LSTM.

Experimental results show that Word2Vec augmentation improves the performance of BERT in detecting the minority class, and the improvement is most significantly (9 percent-30 percent recall increase compared to the base model and 11 percent-12 percent recall increase compared to the model with the oversampling method) when the data size is small (e.

, 500 training documents) and highly imbalanced (e.

, 9:1).

When the data size increases or the imbalance ratio decreases, the improvement generated by the Word2Vec augmentation becomes smaller or insignificant.

Moreover, Word2Vec augmentation plus BERT achieves the best performance compared to other models and methods, demonstrating a promising solution for small-sized, highly imbalanced text classification tasks.

Back

<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...

Sleep Habits and Occurrence of Lowback Pain among Craftsmen

<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...

Enhancing Non-Formal Learning Certificate Classification with Text Augmentation: A Comparison of Character, Token, and Semantic Approaches

Aim/Purpose: The purpose of this paper is to address the gap in the recognition of prior learning (RPL) by automating the classification of non-formal learning certificates using d...

Over-Sampling Effect in Pre-Training for Bidirectional Encoder Representations from Transformers (BERT) to Localize Medical BERT and Enhance Biomedical BERT (Preprint)

BACKGROUND Pre-training large-scale neural language models on raw texts has made a significant contribution to improving transfer learning in natural langua...

Bounds on the sum of broadcast domination number and strong metric dimension of graphs

Let [Formula: see text] be a connected graph of order at least two with vertex set [Formula: see text]. For [Formula: see text], let [Formula: see text] denote the length of an [Fo...

ANALYSIS OF READING MATERIALS IN TEXTBOOK FOR GRADE XI SENIOR HIGH SCHOOL

This study aims to find out the GI and LD level, the text which has the highest GI and LD and what make the text has the highest GI and LD of Advanced Learning English 2 textbook. ...

A Pre-Training Technique to Localize Medical BERT and to Enhance Biomedical BERT

Abstract Background: Pre-training large-scale neural language models on raw texts has been shown to make a significant contribution to a strategy for transfer learning in n...

Abstract The rapid growth of open access publishing (OAP) has significantly improved the accessibility and dissemination of scientific knowledge. However, this expansion has also c...

Email:
Password:

Email:

Performance evaluation of text augmentation methods with BERT on imbalanced datasets

Related Results