Javascript must be enabled to continue!
The many faces of a text : applications and enhancements of multi-label text classification algorithms
View through CrossRef
Multi-Label Text Classification (MLTC) is a challenging yet vital component of analyzing large text collections. The aim of MLTC is to assign one or multiple labels to a text, which can include multiple topics, emotions, or medical codes. This raises a few challenges, including imbalanced label sets, domain-specific terminology, label space complexity, and the increasing complexity of models. Hence, this dissertation explores state-of-the-art techniques to counter these issues on the one hand and critically evaluates existing methods on the other hand. Part I of the dissertation tackles the issue of tracking vaccine hesitancy arguments using MLTC models. First, we describe the development of a vaccine hesitancy monitor (Vaccinpraat) and the task of detecting vaccine hesitancy arguments from X posts and Facebook comments. Additionally, we introduce CoNTACT, a Dutch Large Language Model (LLM) adapted to the language use in COVID-19 X posts. Compared to base models, CoNTACT yields improvements for both vaccine hesitancy detection and multi-label vaccine hesitancy argument classification. Finally, we augment the Vaccinpraat dataset with LLM-generated vaccine-hesitant X posts annotated with multi-label vaccine hesitancy arguments. We find that adding this data, despite its prototypical nature, advances the performance of multiple models on argument classification further. Part II of the dissertation expands the scope of the research by investigating data scarcity, label space complexity, and computational efficiency for multiple domains. We compare the performance of generative LLMs with fine-tuned LLMs for topic classification of news articles related to Corporate Social Responsibility (CSR). To further enhance the performance of fine-tuned LLMs, we train them with additional training objectives and augment the training data with LLM-generated paraphrases of the training data. We observe that fine-tuned LLMs outperform generative LLMs. To address the issue of label space complexity, we model label hierarchies by fine-tuning LLMs with hierarchy-aware loss functions. We explore two geometric spaces to calculate the similarity measures for these loss functions, namely the Euclidean space and the hyperbolic space. We find that both spaces yield equal results for both loss functions. Finally, we investigate a computationally efficient classification method that leverages the semantic similarity between texts and labels. We efficiently optimize label-specific thresholds, which consistently outperforms existing thresholding methods on multiple datasets. In sum, this dissertation offers insight into the complexities of multi-label text classification by tackling several core issues, evaluating existing approaches to these issues, and proposing novel potential solutions.
Title: The many faces of a text : applications and enhancements of multi-label text classification algorithms
Description:
Multi-Label Text Classification (MLTC) is a challenging yet vital component of analyzing large text collections.
The aim of MLTC is to assign one or multiple labels to a text, which can include multiple topics, emotions, or medical codes.
This raises a few challenges, including imbalanced label sets, domain-specific terminology, label space complexity, and the increasing complexity of models.
Hence, this dissertation explores state-of-the-art techniques to counter these issues on the one hand and critically evaluates existing methods on the other hand.
Part I of the dissertation tackles the issue of tracking vaccine hesitancy arguments using MLTC models.
First, we describe the development of a vaccine hesitancy monitor (Vaccinpraat) and the task of detecting vaccine hesitancy arguments from X posts and Facebook comments.
Additionally, we introduce CoNTACT, a Dutch Large Language Model (LLM) adapted to the language use in COVID-19 X posts.
Compared to base models, CoNTACT yields improvements for both vaccine hesitancy detection and multi-label vaccine hesitancy argument classification.
Finally, we augment the Vaccinpraat dataset with LLM-generated vaccine-hesitant X posts annotated with multi-label vaccine hesitancy arguments.
We find that adding this data, despite its prototypical nature, advances the performance of multiple models on argument classification further.
Part II of the dissertation expands the scope of the research by investigating data scarcity, label space complexity, and computational efficiency for multiple domains.
We compare the performance of generative LLMs with fine-tuned LLMs for topic classification of news articles related to Corporate Social Responsibility (CSR).
To further enhance the performance of fine-tuned LLMs, we train them with additional training objectives and augment the training data with LLM-generated paraphrases of the training data.
We observe that fine-tuned LLMs outperform generative LLMs.
To address the issue of label space complexity, we model label hierarchies by fine-tuning LLMs with hierarchy-aware loss functions.
We explore two geometric spaces to calculate the similarity measures for these loss functions, namely the Euclidean space and the hyperbolic space.
We find that both spaces yield equal results for both loss functions.
Finally, we investigate a computationally efficient classification method that leverages the semantic similarity between texts and labels.
We efficiently optimize label-specific thresholds, which consistently outperforms existing thresholding methods on multiple datasets.
In sum, this dissertation offers insight into the complexities of multi-label text classification by tackling several core issues, evaluating existing approaches to these issues, and proposing novel potential solutions.
Related Results
Examining Swarm Intelligence-based Feature Selection for Multi-Label Classification
Examining Swarm Intelligence-based Feature Selection for Multi-Label Classification
Multi-label classification addresses the issues that more than one class label assigns to each instance. Many real-world multi-label classification tasks are high-dimensional due t...
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...
An evolutionary decomposition-based multi-objective feature selection for multi-label classification
An evolutionary decomposition-based multi-objective feature selection for multi-label classification
Data classification is a fundamental task in data mining. Within this field, the classification of multi-labeled data has been seriously considered in recent years. In such problem...
Sentencing Enhancements
Sentencing Enhancements
Sentencing enhancements are policies that mandate that people who are convicted of criminalized behaviors while engaging in generally non-criminalized behaviors—such as being in a ...
BMFS: Bidirectional weighted approach for multi-label feature selection algorithm
BMFS: Bidirectional weighted approach for multi-label feature selection algorithm
Abstract
Shortcomings of the existing multi-label feature selection algorithms, such as non-considering the correlation of label space, ignoring the possible difference of ...
Afaan Oromo Multi-Label News Text Classification Using Deep Learning Approach
Afaan Oromo Multi-Label News Text Classification Using Deep Learning Approach
Abstract
Classification is a technique for categorizing textual data into a form of predefined categories. Due to its major consequences in regard to critical tasks such as...
Bounds on the sum of broadcast domination number and strong metric dimension of graphs
Bounds on the sum of broadcast domination number and strong metric dimension of graphs
Let [Formula: see text] be a connected graph of order at least two with vertex set [Formula: see text]. For [Formula: see text], let [Formula: see text] denote the length of an [Fo...

