Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

An Integrated Clustering and BERT Framework for Improved Topic Modeling

View through CrossRef
Abstract Topic modelling is a machine learning technique that is extensively used in Natural Language Processing (NLP) applications to infer topics within unstructured textual data. Latent Dirichlet Allocation (LDA) is one of the most used topic modeling techniques that can automatically detect topics from a huge collection of text documents. However, the LDA-based topic models alone do not always provide promising results. Clustering is one of the effective unsupervised machine learning algorithms that are extensively used in applications including extracting information from unstructured textual data and topic modeling. A hybrid model of Bidirectional Encoder Representations from Transformers (BERT) and Latent Dirichlet Allocation (LDA) in topic modeling with clustering have been studied in detail. As the clustering algorithms are computationally complex, the complexity increases with the higher number of features, the PCA, t-SNE & UMAP-based dimensionality reduction methods are also performed. Finally, a unified clustering-based framework using BERT and LDA is proposed as part of this study for mining a set of meaningful topics from the massive text corpora. The experiments are conducted to demonstrate the effectiveness of the cluster-informed topic modeling framework using BERT and LDA by simulating user input on benchmark datasets. The experimental results show that clustering would help infer more coherent topics and hence this unified clustering and BERT-LDA based approach can be effectively utilized for building topic modeling applications.
Research Square Platform LLC
Title: An Integrated Clustering and BERT Framework for Improved Topic Modeling
Description:
Abstract Topic modelling is a machine learning technique that is extensively used in Natural Language Processing (NLP) applications to infer topics within unstructured textual data.
Latent Dirichlet Allocation (LDA) is one of the most used topic modeling techniques that can automatically detect topics from a huge collection of text documents.
However, the LDA-based topic models alone do not always provide promising results.
Clustering is one of the effective unsupervised machine learning algorithms that are extensively used in applications including extracting information from unstructured textual data and topic modeling.
A hybrid model of Bidirectional Encoder Representations from Transformers (BERT) and Latent Dirichlet Allocation (LDA) in topic modeling with clustering have been studied in detail.
As the clustering algorithms are computationally complex, the complexity increases with the higher number of features, the PCA, t-SNE & UMAP-based dimensionality reduction methods are also performed.
Finally, a unified clustering-based framework using BERT and LDA is proposed as part of this study for mining a set of meaningful topics from the massive text corpora.
The experiments are conducted to demonstrate the effectiveness of the cluster-informed topic modeling framework using BERT and LDA by simulating user input on benchmark datasets.
The experimental results show that clustering would help infer more coherent topics and hence this unified clustering and BERT-LDA based approach can be effectively utilized for building topic modeling applications.

Related Results

The Kernel Rough K-Means Algorithm
The Kernel Rough K-Means Algorithm
Background: Clustering is one of the most important data mining methods. The k-means (c-means ) and its derivative methods are the hotspot in the field of clustering research in re...
A Pre-Training Technique to Localize Medical BERT and to Enhance Biomedical BERT
A Pre-Training Technique to Localize Medical BERT and to Enhance Biomedical BERT
Abstract Background: Pre-training large-scale neural language models on raw texts has been shown to make a significant contribution to a strategy for transfer learning in n...
Topics (Automated Content Analysis)
Topics (Automated Content Analysis)
Topics describe the main issue discussed in an article, for example: Does an article deal with politics, economics or sports? Field of application/theoretical foundation: In the co...
The Performance of BERT as Data Representation of Text Clustering
The Performance of BERT as Data Representation of Text Clustering
Abstract Text clustering is the task of grouping a set of texts so that text in the same group will be more similar than those from a different group. The process of groupi...
Improving chinese hate speech detection with bert-fasttext fusion and BERT-BiLSTM fusion
Improving chinese hate speech detection with bert-fasttext fusion and BERT-BiLSTM fusion
Hate speech detection is an essential technique in the online environment, especially on social media platforms. This technique helps to create a safer space and reduce the risk of...
Comment text clustering algorithm based on improved DEC
Comment text clustering algorithm based on improved DEC
Aiming at the problem that the initial number of clusters and cluster centers obtained by the clustering layer in the original deep embedding clustering (DEC) algorithm are highly ...
Image clustering using exponential discriminant analysis
Image clustering using exponential discriminant analysis
Local learning based image clustering models are usually employed to deal with images sampled from the non‐linear manifold. Recently, linear discriminant analysis (LDA) based vario...

Back to Top