Javascript must be enabled to continue!
Exploring Word Embeddings for Text Classification: A Comparative Analysis
View through CrossRef
For language tasks like text classification and sequence labeling, word embeddings are essential for providing input characteristics in deep models. There have been many word embedding techniques put out in the past ten years, which can be broadly divided into classic and context-based embeddings. In this study, two encoders—CNN and BiLSTM—are used in a downstream network architecture to analyze both forms of embeddings in the context of text classification. Four benchmarking classification datasets with single-label and multi-label tasks and a range of average sample lengths are selected in order to evaluate the effects of word embeddings on various datasets. CNN routinely beats BiLSTM, especially on datasets that don't take document context into account, according to the evaluation results with confidence intervals. CNN is therefore advised above BiLSTM for datasets involving document categorization where context is less predictive of class membership. Concatenating numerous classic embeddings or growing their size for word embeddings doesn't greatly increase performance, while there are few instances when there are marginal gains. Contrarily, context-based embeddings like ELMo and BERT are investigated, with BERT showing better overall performance, particularly for longer document datasets. On short datasets, both context-based embeddings perform better, but on longer datasets, no significant improvement is seen.In conclusion, this study emphasizes the significance of word embeddings and their impact on downstream tasks, highlighting the advantages of BERT over ELMo, especially for lengthier documents, and CNN over BiLSTM for certain scenarios involving document classification.
Title: Exploring Word Embeddings for Text Classification: A Comparative Analysis
Description:
For language tasks like text classification and sequence labeling, word embeddings are essential for providing input characteristics in deep models.
There have been many word embedding techniques put out in the past ten years, which can be broadly divided into classic and context-based embeddings.
In this study, two encoders—CNN and BiLSTM—are used in a downstream network architecture to analyze both forms of embeddings in the context of text classification.
Four benchmarking classification datasets with single-label and multi-label tasks and a range of average sample lengths are selected in order to evaluate the effects of word embeddings on various datasets.
CNN routinely beats BiLSTM, especially on datasets that don't take document context into account, according to the evaluation results with confidence intervals.
CNN is therefore advised above BiLSTM for datasets involving document categorization where context is less predictive of class membership.
Concatenating numerous classic embeddings or growing their size for word embeddings doesn't greatly increase performance, while there are few instances when there are marginal gains.
Contrarily, context-based embeddings like ELMo and BERT are investigated, with BERT showing better overall performance, particularly for longer document datasets.
On short datasets, both context-based embeddings perform better, but on longer datasets, no significant improvement is seen.
In conclusion, this study emphasizes the significance of word embeddings and their impact on downstream tasks, highlighting the advantages of BERT over ELMo, especially for lengthier documents, and CNN over BiLSTM for certain scenarios involving document classification.
Related Results
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...
Primerjalna književnost na prelomu tisočletja
Primerjalna književnost na prelomu tisočletja
In a comprehensive and at times critical manner, this volume seeks to shed light on the development of events in Western (i.e., European and North American) comparative literature ...
Bounds on the sum of broadcast domination number and strong metric dimension of graphs
Bounds on the sum of broadcast domination number and strong metric dimension of graphs
Let [Formula: see text] be a connected graph of order at least two with vertex set [Formula: see text]. For [Formula: see text], let [Formula: see text] denote the length of an [Fo...
Learned Text Representation for Amharic Information Retrieval and Natural Language Processing
Learned Text Representation for Amharic Information Retrieval and Natural Language Processing
Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations fo...
ANALYSIS OF READING MATERIALS IN TEXTBOOK FOR GRADE XI SENIOR HIGH SCHOOL
ANALYSIS OF READING MATERIALS IN TEXTBOOK FOR GRADE XI SENIOR HIGH SCHOOL
This study aims to find out the GI and LD level, the text which has the highest GI and LD and what make the text has the highest GI and LD of Advanced Learning English 2 textbook. ...
E-Press and Oppress
E-Press and Oppress
From elephants to ABBA fans, silicon to hormone, the following discussion uses a new research method to look at printed text, motion pictures and a te...
Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study (Preprint)
Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study (Preprint)
BACKGROUND
Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embe...

