Javascript must be enabled to continue!

The many faces of a text : applications and enhancements of multi-label text classification algorithms

Multi-Label Text Classification (MLTC) is a challenging yet vital component of analyzing large text collections. The aim of MLTC is to assign one or multiple labels to a text, which can include multiple topics, emotions, or medical codes. This raises a few challenges, including imbalanced label sets, domain-specific terminology, label space complexity, and the increasing complexity of models. Hence, this dissertation explores state-of-the-art techniques to counter these issues on the one hand and critically evaluates existing methods on the other hand. Part I of the dissertation tackles the issue of tracking vaccine hesitancy arguments using MLTC models. First, we describe the development of a vaccine hesitancy monitor (Vaccinpraat) and the task of detecting vaccine hesitancy arguments from X posts and Facebook comments. Additionally, we introduce CoNTACT, a Dutch Large Language Model (LLM) adapted to the language use in COVID-19 X posts. Compared to base models, CoNTACT yields improvements for both vaccine hesitancy detection and multi-label vaccine hesitancy argument classification. Finally, we augment the Vaccinpraat dataset with LLM-generated vaccine-hesitant X posts annotated with multi-label vaccine hesitancy arguments. We find that adding this data, despite its prototypical nature, advances the performance of multiple models on argument classification further. Part II of the dissertation expands the scope of the research by investigating data scarcity, label space complexity, and computational efficiency for multiple domains. We compare the performance of generative LLMs with fine-tuned LLMs for topic classification of news articles related to Corporate Social Responsibility (CSR). To further enhance the performance of fine-tuned LLMs, we train them with additional training objectives and augment the training data with LLM-generated paraphrases of the training data. We observe that fine-tuned LLMs outperform generative LLMs. To address the issue of label space complexity, we model label hierarchies by fine-tuning LLMs with hierarchy-aware loss functions. We explore two geometric spaces to calculate the similarity measures for these loss functions, namely the Euclidean space and the hyperbolic space. We find that both spaces yield equal results for both loss functions. Finally, we investigate a computationally efficient classification method that leverages the semantic similarity between texts and labels. We efficiently optimize label-specific thresholds, which consistently outperforms existing thresholding methods on multiple datasets. In sum, this dissertation offers insight into the complexities of multi-label text classification by tackling several core issues, evaluating existing approaches to these issues, and proposing novel potential solutions.

University of Antwerp

Jens Van Nooten

2025

Title: The many faces of a text : applications and enhancements of multi-label text classification algorithms

Description:

Multi-Label Text Classification (MLTC) is a challenging yet vital component of analyzing large text collections.

The aim of MLTC is to assign one or multiple labels to a text, which can include multiple topics, emotions, or medical codes.

This raises a few challenges, including imbalanced label sets, domain-specific terminology, label space complexity, and the increasing complexity of models.

Hence, this dissertation explores state-of-the-art techniques to counter these issues on the one hand and critically evaluates existing methods on the other hand.

Part I of the dissertation tackles the issue of tracking vaccine hesitancy arguments using MLTC models.

First, we describe the development of a vaccine hesitancy monitor (Vaccinpraat) and the task of detecting vaccine hesitancy arguments from X posts and Facebook comments.

Additionally, we introduce CoNTACT, a Dutch Large Language Model (LLM) adapted to the language use in COVID-19 X posts.

Compared to base models, CoNTACT yields improvements for both vaccine hesitancy detection and multi-label vaccine hesitancy argument classification.

Finally, we augment the Vaccinpraat dataset with LLM-generated vaccine-hesitant X posts annotated with multi-label vaccine hesitancy arguments.

We find that adding this data, despite its prototypical nature, advances the performance of multiple models on argument classification further.

Part II of the dissertation expands the scope of the research by investigating data scarcity, label space complexity, and computational efficiency for multiple domains.

We compare the performance of generative LLMs with fine-tuned LLMs for topic classification of news articles related to Corporate Social Responsibility (CSR).

To further enhance the performance of fine-tuned LLMs, we train them with additional training objectives and augment the training data with LLM-generated paraphrases of the training data.

We observe that fine-tuned LLMs outperform generative LLMs.

To address the issue of label space complexity, we model label hierarchies by fine-tuning LLMs with hierarchy-aware loss functions.

We explore two geometric spaces to calculate the similarity measures for these loss functions, namely the Euclidean space and the hyperbolic space.

We find that both spaces yield equal results for both loss functions.

Finally, we investigate a computationally efficient classification method that leverages the semantic similarity between texts and labels.

We efficiently optimize label-specific thresholds, which consistently outperforms existing thresholding methods on multiple datasets.

In sum, this dissertation offers insight into the complexities of multi-label text classification by tackling several core issues, evaluating existing approaches to these issues, and proposing novel potential solutions.

Back

Multi-label classification addresses the issues that more than one class label assigns to each instance. Many real-world multi-label classification tasks are high-dimensional due t...

Sleep Habits and Occurrence of Lowback Pain among Craftsmen

<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...

Sleep Habits and Occurrence of Lowback Pain among Craftsmen

<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...

Sentencing Enhancements

Sentencing enhancements are policies that mandate that people who are convicted of criminalized behaviors while engaging in generally non-criminalized behaviors—such as being in a ...

An evolutionary decomposition-based multi-objective feature selection for multi-label classification

Data classification is a fundamental task in data mining. Within this field, the classification of multi-labeled data has been seriously considered in recent years. In such problem...

BMFS: Bidirectional weighted approach for multi-label feature selection algorithm

Abstract Shortcomings of the existing multi-label feature selection algorithms, such as non-considering the correlation of label space, ignoring the possible difference of ...

Afaan Oromo Multi-Label News Text Classification Using Deep Learning Approach

Abstract Classification is a technique for categorizing textual data into a form of predefined categories. Due to its major consequences in regard to critical tasks such as...

Hubungan Pengetahuan terkait Label Gizi dengan Kebiasaan Membaca Label Gizi pada Siswa SMA Al-Islam

Latar Belakang: Masih sedikit konsumen yang dapat memahami dan menggunakan label gizi sesuai dengan fungsinya. Hal ini dikarenakan masih rendahnya kesadaran masyarakat terkait pent...

Email:
Password:

Email:

The many faces of a text : applications and enhancements of multi-label text classification algorithms

Related Results