Javascript must be enabled to continue!

The many faces of a text : applications and enhancements of multi-label text classification algorithms

Multi-Label Text Classification (MLTC) is a challenging yet vital component of analyzing large text collections. The aim of MLTC is to assign one or multiple labels to a text, which can include multiple topics, emotions, or medical codes. This raises a few challenges, including imbalanced label sets, domain-specific terminology, label space complexity, and the increasing complexity of models. Hence, this dissertation explores state-of-the-art techniques to counter these issues on the one hand and critically evaluates existing methods on the other hand. Part I of the dissertation tackles the issue of tracking vaccine hesitancy arguments using MLTC models. First, we describe the development of a vaccine hesitancy monitor (Vaccinpraat) and the task of detecting vaccine hesitancy arguments from X posts and Facebook comments. Additionally, we introduce CoNTACT, a Dutch Large Language Model (LLM) adapted to the language use in COVID-19 X posts. Compared to base models, CoNTACT yields improvements for both vaccine hesitancy detection and multi-label vaccine hesitancy argument classification. Finally, we augment the Vaccinpraat dataset with LLM-generated vaccine-hesitant X posts annotated with multi-label vaccine hesitancy arguments. We find that adding this data, despite its prototypical nature, advances the performance of multiple models on argument classification further. Part II of the dissertation expands the scope of the research by investigating data scarcity, label space complexity, and computational efficiency for multiple domains. We compare the performance of generative LLMs with fine-tuned LLMs for topic classification of news articles related to Corporate Social Responsibility (CSR). To further enhance the performance of fine-tuned LLMs, we train them with additional training objectives and augment the training data with LLM-generated paraphrases of the training data. We observe that fine-tuned LLMs outperform generative LLMs. To address the issue of label space complexity, we model label hierarchies by fine-tuning LLMs with hierarchy-aware loss functions. We explore two geometric spaces to calculate the similarity measures for these loss functions, namely the Euclidean space and the hyperbolic space. We find that both spaces yield equal results for both loss functions. Finally, we investigate a computationally efficient classification method that leverages the semantic similarity between texts and labels. We efficiently optimize label-specific thresholds, which consistently outperforms existing thresholding methods on multiple datasets. In sum, this dissertation offers insight into the complexities of multi-label text classification by tackling several core issues, evaluating existing approaches to these issues, and proposing novel potential solutions.

University of Antwerp

Jens Van Nooten

2025

Title: The many faces of a text : applications and enhancements of multi-label text classification algorithms

Description:

Multi-Label Text Classification (MLTC) is a challenging yet vital component of analyzing large text collections.

The aim of MLTC is to assign one or multiple labels to a text, which can include multiple topics, emotions, or medical codes.

This raises a few challenges, including imbalanced label sets, domain-specific terminology, label space complexity, and the increasing complexity of models.

Hence, this dissertation explores state-of-the-art techniques to counter these issues on the one hand and critically evaluates existing methods on the other hand.

Part I of the dissertation tackles the issue of tracking vaccine hesitancy arguments using MLTC models.

First, we describe the development of a vaccine hesitancy monitor (Vaccinpraat) and the task of detecting vaccine hesitancy arguments from X posts and Facebook comments.

Additionally, we introduce CoNTACT, a Dutch Large Language Model (LLM) adapted to the language use in COVID-19 X posts.

Compared to base models, CoNTACT yields improvements for both vaccine hesitancy detection and multi-label vaccine hesitancy argument classification.

Finally, we augment the Vaccinpraat dataset with LLM-generated vaccine-hesitant X posts annotated with multi-label vaccine hesitancy arguments.

We find that adding this data, despite its prototypical nature, advances the performance of multiple models on argument classification further.

Part II of the dissertation expands the scope of the research by investigating data scarcity, label space complexity, and computational efficiency for multiple domains.

We compare the performance of generative LLMs with fine-tuned LLMs for topic classification of news articles related to Corporate Social Responsibility (CSR).

To further enhance the performance of fine-tuned LLMs, we train them with additional training objectives and augment the training data with LLM-generated paraphrases of the training data.

We observe that fine-tuned LLMs outperform generative LLMs.

To address the issue of label space complexity, we model label hierarchies by fine-tuning LLMs with hierarchy-aware loss functions.

We explore two geometric spaces to calculate the similarity measures for these loss functions, namely the Euclidean space and the hyperbolic space.

We find that both spaces yield equal results for both loss functions.

Finally, we investigate a computationally efficient classification method that leverages the semantic similarity between texts and labels.

We efficiently optimize label-specific thresholds, which consistently outperforms existing thresholding methods on multiple datasets.

In sum, this dissertation offers insight into the complexities of multi-label text classification by tackling several core issues, evaluating existing approaches to these issues, and proposing novel potential solutions.

Back

Related Results

Sentencing Enhancements

Sentencing enhancements are policies that mandate that people who are convicted of criminalized behaviors while engaging in generally non-criminalized behaviors—such as being in a ...

BMFS: Bidirectional weighted approach for multi-label feature selection algorithm

Abstract Shortcomings of the existing multi-label feature selection algorithms, such as non-considering the correlation of label space, ignoring the possible difference of ...

Afaan Oromo Multi-Label News Text Classification Using Deep Learning Approach

Abstract Classification is a technique for categorizing textual data into a form of predefined categories. Due to its major consequences in regard to critical tasks such as...

Hubungan Pengetahuan terkait Label Gizi dengan Kebiasaan Membaca Label Gizi pada Siswa SMA Al-Islam

Latar Belakang: Masih sedikit konsumen yang dapat memahami dan menggunakan label gizi sesuai dengan fungsinya. Hal ini dikarenakan masih rendahnya kesadaran masyarakat terkait pent...

Multi-label Emotion Classification on Social Media Comments using Deep learning

Abstract Social media is an online platform that people use to develop social networks or relationships with others. Every day, millions of people use different social medi...

Fuze Well Mechanical Interface

<div class="section abstract"> <div class="htmlview paragraph">This interface standard applies to fuzes used in airborne weapons that use a 3-Inch Fuze Well. It defin...

E-Press and Oppress

From elephants to ABBA fans, silicon to hormone, the following discussion uses a new research method to look at printed text, motion pictures and a te...

Label Ranker: Self-Aware Preference for Classification Label Position in Visual Masked Self-Supervised Pre-Trained Model

This paper investigates the impact of randomly initialized unique encoding of classification label position on the visual masked self-supervised pre-trained model when fine-tuning ...

Email:
Password:

Email: