Javascript must be enabled to continue!

Comparative Analysis of Topic Modeling Algorithms for Short Texts in Persian Tweets

Abstract Topic modeling is a popular natural language processing technique to uncover hidden patterns and topics in extensive text collections. However, there is a lack of comprehensive studies that focus specifically on applying topic modeling algorithms to short texts, particularly from social media platforms. Even fewer studies have explored comparing different topic modeling algorithms for low-resource languages such as Persian. Our study aims to address this gap by thoroughly investigating topic modeling algorithms and metrics tailored for short texts, particularly Persian tweets. We collected and preprocessed a substantial dataset of Persian tweets. We also developed a dedicated tool that enables reproducibility and facilitates the evaluation of various topic modeling algorithms using the provided dataset. Our comparative analysis included Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), Latent Semantic Indexing (LSI), Gibbs Sampling Dirichlet Mixture Model (GSDMM), and Correlated Topic Model (CTM). To measure their performance, we employed well-established metrics, namely Purity, Normalized Mutual Information (NMI), and Coherence. Our experimental results indicate that GSDMM and CTM+BERT exhibit superior performance compared to other algorithms in terms of purity and NMI on the Persian short-text topic modeling dataset. Additionally, CTM+BERT demonstrates competitive coherence performance compared to GSDMM. Our study provides valuable insights into the effectiveness of different topic modeling approaches for short texts and can help researchers select the most appropriate algorithm for their specific use case.

Research Square Platform LLC

Amir Hossein Karimi Masoud Akbari Mohammad Akbari

2023

Title: Comparative Analysis of Topic Modeling Algorithms for Short Texts in Persian Tweets

Description:

Abstract Topic modeling is a popular natural language processing technique to uncover hidden patterns and topics in extensive text collections.

However, there is a lack of comprehensive studies that focus specifically on applying topic modeling algorithms to short texts, particularly from social media platforms.

Even fewer studies have explored comparing different topic modeling algorithms for low-resource languages such as Persian.

Our study aims to address this gap by thoroughly investigating topic modeling algorithms and metrics tailored for short texts, particularly Persian tweets.

We collected and preprocessed a substantial dataset of Persian tweets.

We also developed a dedicated tool that enables reproducibility and facilitates the evaluation of various topic modeling algorithms using the provided dataset.

Our comparative analysis included Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), Latent Semantic Indexing (LSI), Gibbs Sampling Dirichlet Mixture Model (GSDMM), and Correlated Topic Model (CTM).

To measure their performance, we employed well-established metrics, namely Purity, Normalized Mutual Information (NMI), and Coherence.

Our experimental results indicate that GSDMM and CTM+BERT exhibit superior performance compared to other algorithms in terms of purity and NMI on the Persian short-text topic modeling dataset.

Additionally, CTM+BERT demonstrates competitive coherence performance compared to GSDMM.

Our study provides valuable insights into the effectiveness of different topic modeling approaches for short texts and can help researchers select the most appropriate algorithm for their specific use case.

Back

In a comprehensive and at times critical manner, this volume seeks to shed light on the development of events in Western (i.e., European and North American) comparative literature ...

Does X Mark the Spot? Investigating discussions about cancer screening programs on X/Twitter through corpus analysis (Preprint)

BACKGROUND While cancer screening is proven to be effective in the early detection of the disease and early detection enables better treatment options, screening ...

Evaluation of Medical Confidentiality Breaches on Twitter Among Anesthesiology and Intensive Care Health Care Workers

BACKGROUND: With the generalization of social network use by health care workers, we observe the emergence of breaches in medical confidentiality. Our objective was to ...

Faith Tweets: Ambient Religious Communication and Microblogging Rituals

There’s no reason to think that Jesus wouldn’t have Facebooked or twittered if he came into the world now. Can you imagine his killer status updates? Reverend Schenck, New York, Al...

Using Social Media to Predict Food Deserts in the United States: Infodemiology Study of Tweets

Background The issue of food insecurity is becoming increasingly important to public health practitioners because of the adverse health outcomes and underlying ...

Using Social Media to Predict Food Deserts in the United States: Infodemiology Study of Tweets (Preprint)

BACKGROUND The issue of food insecurity is becoming increasingly important to public health practitioners because of the adverse health outcomes and underly...

Sentiment Analysis of Tweets on Soda Taxes

Context: As a primary source of added sugars, sugar-sweetened beverage (SSB) consumption may contribute to the obesity epidemic. A soda tax is an excise tax cha...

Sentiment Analysis of Russia-Ukraine Conflict Tweets Using RoBERTa

[Objective] The moment Russia officially invaded Ukraine, the world experienced a period of tension and uncertainty. As a social release valve digital communication, channels incre...

Email:
Password:

Email:

Comparative Analysis of Topic Modeling Algorithms for Short Texts in Persian Tweets

Related Results