Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Comparative Analysis of Topic Modeling Algorithms for Short Texts in Persian Tweets

View through CrossRef
Abstract Topic modeling is a popular natural language processing technique to uncover hidden patterns and topics in extensive text collections. However, there is a lack of comprehensive studies that focus specifically on applying topic modeling algorithms to short texts, particularly from social media platforms. Even fewer studies have explored comparing different topic modeling algorithms for low-resource languages such as Persian. Our study aims to address this gap by thoroughly investigating topic modeling algorithms and metrics tailored for short texts, particularly Persian tweets. We collected and preprocessed a substantial dataset of Persian tweets. We also developed a dedicated tool that enables reproducibility and facilitates the evaluation of various topic modeling algorithms using the provided dataset. Our comparative analysis included Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), Latent Semantic Indexing (LSI), Gibbs Sampling Dirichlet Mixture Model (GSDMM), and Correlated Topic Model (CTM). To measure their performance, we employed well-established metrics, namely Purity, Normalized Mutual Information (NMI), and Coherence. Our experimental results indicate that GSDMM and CTM+BERT exhibit superior performance compared to other algorithms in terms of purity and NMI on the Persian short-text topic modeling dataset. Additionally, CTM+BERT demonstrates competitive coherence performance compared to GSDMM. Our study provides valuable insights into the effectiveness of different topic modeling approaches for short texts and can help researchers select the most appropriate algorithm for their specific use case.
Title: Comparative Analysis of Topic Modeling Algorithms for Short Texts in Persian Tweets
Description:
Abstract Topic modeling is a popular natural language processing technique to uncover hidden patterns and topics in extensive text collections.
However, there is a lack of comprehensive studies that focus specifically on applying topic modeling algorithms to short texts, particularly from social media platforms.
Even fewer studies have explored comparing different topic modeling algorithms for low-resource languages such as Persian.
Our study aims to address this gap by thoroughly investigating topic modeling algorithms and metrics tailored for short texts, particularly Persian tweets.
We collected and preprocessed a substantial dataset of Persian tweets.
We also developed a dedicated tool that enables reproducibility and facilitates the evaluation of various topic modeling algorithms using the provided dataset.
Our comparative analysis included Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), Latent Semantic Indexing (LSI), Gibbs Sampling Dirichlet Mixture Model (GSDMM), and Correlated Topic Model (CTM).
To measure their performance, we employed well-established metrics, namely Purity, Normalized Mutual Information (NMI), and Coherence.
Our experimental results indicate that GSDMM and CTM+BERT exhibit superior performance compared to other algorithms in terms of purity and NMI on the Persian short-text topic modeling dataset.
Additionally, CTM+BERT demonstrates competitive coherence performance compared to GSDMM.
Our study provides valuable insights into the effectiveness of different topic modeling approaches for short texts and can help researchers select the most appropriate algorithm for their specific use case.

Related Results

Primerjalna književnost na prelomu tisočletja
Primerjalna književnost na prelomu tisočletja
In a comprehensive and at times critical manner, this volume seeks to shed light on the development of events in Western (i.e., European and North American) comparative literature ...
Evaluation of Medical Confidentiality Breaches on Twitter Among Anesthesiology and Intensive Care Health Care Workers
Evaluation of Medical Confidentiality Breaches on Twitter Among Anesthesiology and Intensive Care Health Care Workers
BACKGROUND: With the generalization of social network use by health care workers, we observe the emergence of breaches in medical confidentiality. Our objective was to ...
Faith Tweets: Ambient Religious Communication and Microblogging Rituals
Faith Tweets: Ambient Religious Communication and Microblogging Rituals
There’s no reason to think that Jesus wouldn’t have Facebooked or twittered if he came into the world now. Can you imagine his killer status updates? Reverend Schenck, New York, Al...
Using Social Media to Predict Food Deserts in the United States: Infodemiology Study of Tweets
Using Social Media to Predict Food Deserts in the United States: Infodemiology Study of Tweets
Background The issue of food insecurity is becoming increasingly important to public health practitioners because of the adverse health outcomes and underlying ...
Using Social Media to Predict Food Deserts in the United States: Infodemiology Study of Tweets (Preprint)
Using Social Media to Predict Food Deserts in the United States: Infodemiology Study of Tweets (Preprint)
BACKGROUND The issue of food insecurity is becoming increasingly important to public health practitioners because of the adverse health outcomes and underly...
Sentiment Analysis of Tweets on Soda Taxes
Sentiment Analysis of Tweets on Soda Taxes
Context: As a primary source of added sugars, sugar-sweetened beverage (SSB) consumption may contribute to the obesity epidemic. A soda tax is an excise tax cha...
Sentiment Analysis of Russia-Ukraine Conflict Tweets Using RoBERTa
Sentiment Analysis of Russia-Ukraine Conflict Tweets Using RoBERTa
[Objective] The moment Russia officially invaded Ukraine, the world experienced a period of tension and uncertainty. As a social release valve digital communication, channels incre...

Back to Top