Javascript must be enabled to continue!
Comparative Analysis of Topic Modeling Algorithms for Short Texts in Persian Tweets
View through CrossRef
Abstract
Topic modeling is a popular natural language processing technique to uncover hidden patterns and topics in extensive text collections. However, there is a lack of comprehensive studies that focus specifically on applying topic modeling algorithms to short texts, particularly from social media platforms. Even fewer studies have explored comparing different topic modeling algorithms for low-resource languages such as Persian. Our study aims to address this gap by thoroughly investigating topic modeling algorithms and metrics tailored for short texts, particularly Persian tweets. We collected and preprocessed a substantial dataset of Persian tweets. We also developed a dedicated tool that enables reproducibility and facilitates the evaluation of various topic modeling algorithms using the provided dataset. Our comparative analysis included Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), Latent Semantic Indexing (LSI), Gibbs Sampling Dirichlet Mixture Model (GSDMM), and Correlated Topic Model (CTM). To measure their performance, we employed well-established metrics, namely Purity, Normalized Mutual Information (NMI), and Coherence. Our experimental results indicate that GSDMM and CTM+BERT exhibit superior performance compared to other algorithms in terms of purity and NMI on the Persian short-text topic modeling dataset. Additionally, CTM+BERT demonstrates competitive coherence performance compared to GSDMM. Our study provides valuable insights into the effectiveness of different topic modeling approaches for short texts and can help researchers select the most appropriate algorithm for their specific use case.
Title: Comparative Analysis of Topic Modeling Algorithms for Short Texts in Persian Tweets
Description:
Abstract
Topic modeling is a popular natural language processing technique to uncover hidden patterns and topics in extensive text collections.
However, there is a lack of comprehensive studies that focus specifically on applying topic modeling algorithms to short texts, particularly from social media platforms.
Even fewer studies have explored comparing different topic modeling algorithms for low-resource languages such as Persian.
Our study aims to address this gap by thoroughly investigating topic modeling algorithms and metrics tailored for short texts, particularly Persian tweets.
We collected and preprocessed a substantial dataset of Persian tweets.
We also developed a dedicated tool that enables reproducibility and facilitates the evaluation of various topic modeling algorithms using the provided dataset.
Our comparative analysis included Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), Latent Semantic Indexing (LSI), Gibbs Sampling Dirichlet Mixture Model (GSDMM), and Correlated Topic Model (CTM).
To measure their performance, we employed well-established metrics, namely Purity, Normalized Mutual Information (NMI), and Coherence.
Our experimental results indicate that GSDMM and CTM+BERT exhibit superior performance compared to other algorithms in terms of purity and NMI on the Persian short-text topic modeling dataset.
Additionally, CTM+BERT demonstrates competitive coherence performance compared to GSDMM.
Our study provides valuable insights into the effectiveness of different topic modeling approaches for short texts and can help researchers select the most appropriate algorithm for their specific use case.
Related Results
Primerjalna književnost na prelomu tisočletja
Primerjalna književnost na prelomu tisočletja
In a comprehensive and at times critical manner, this volume seeks to shed light on the development of events in Western (i.e., European and North American) comparative literature ...
Evaluation of Medical Confidentiality Breaches on Twitter Among Anesthesiology and Intensive Care Health Care Workers
Evaluation of Medical Confidentiality Breaches on Twitter Among Anesthesiology and Intensive Care Health Care Workers
BACKGROUND:
With the generalization of social network use by health care workers, we observe the emergence of breaches in medical confidentiality. Our objective was to ...
Sentiment Analysis of Tweets on Soda Taxes
Sentiment Analysis of Tweets on Soda Taxes
Context:
As a primary source of added sugars, sugar-sweetened beverage (SSB) consumption may contribute to the obesity epidemic. A soda tax is an excise tax charged on ...
Sentiment Analysis of Russia-Ukraine Conflict Tweets Using RoBERTa
Sentiment Analysis of Russia-Ukraine Conflict Tweets Using RoBERTa
[Objective] The moment Russia officially invaded Ukraine, the world experienced a period of tension and uncertainty. As a social release valve digital communication, channels incre...
Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika
Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika
Genre implies formal and stylistic conventions of a particular text type, which inevitably affects the translation process. This „force of genre bias“ (Prieto Ramos, 2014) has been...
#Menopause: The Menopause Ontology Project
#Menopause: The Menopause Ontology Project
ABSTRACT
Introduction
Medical professionals and patients increasingly utilize social media to connect and share healthcare infor...
Study of the Yahoo-yahoo Hash-tag Tweets Using Sentiment Analysis and Opinion Mining Algorithms
Study of the Yahoo-yahoo Hash-tag Tweets Using Sentiment Analysis and Opinion Mining Algorithms
Abstract
BackgroundSocial media opinion has become a medium to quickly access large, valuable, and rich details of information on any subject matter within a short period. ...
Exploring the topical structure of short text through probability models : from tasks to fundamentals
Exploring the topical structure of short text through probability models : from tasks to fundamentals
Recent technological advances have radically changed the way we communicate. Today’s
communication has become ubiquitous and it has fostered the need for information that is easie...

