Javascript must be enabled to continue!

Semantic clustering method using integration of advanced LDA algorithm and BERT algorithm

The subject of the study is an in-depth semantic data analysis based on the modification of the Latent Dirichlet Allocation (LDA) methodology and its integration with the bidirectional encoding representation of transformers (BERT). Relevance. Latent Dirichlet Allocation (LDA) is a fundamental topic modeling technique that is widely used in a variety of text analysis applications. Although its usefulness is widely recognized, traditional LDA models often face limitations, such as a rigid distribution of topics and inadequate representation of semantic nuances inherent in natural language. The purpose and main idea of the study is to improve the adequacy and accuracy of semantic analysis by improving the basic LDA mechanism that integrates adaptive Dirichlet priorities and exploits the deep semantic capabilities of BERT embeddings. Research methods: 1) selection of textual datasets; 2) data preprocessing steps; 3) improvement of the LDA algorithm; 4) integration with BERT Embeddings; 5) comparative analysis. Research objectives: 1) theoretical substantiation of LDA modification; 2) implementation of integration with BERT; 3) evaluation of the method efficiency; 4) comparative analysis; 5) development of an architectural solution. The results of the research are that, first of all, the theoretical foundations of both the standard and modified LDA models are outlined, and their extended formula is presented in detail. Through a series of experiments on text datasets characterized by different emotional states, we emphasize the key advantages of the proposed approach. Based on a comparative analysis of such indicators as intra- and inter-cluster distances and silhouette coefficient, we prove the increased coherence, interpretability, and adaptability of the modified LDA model. An architectural solution for implementing the method is proposed. Conclusions. The empirical results indicate a significant improvement in the detection of subtle complexities and thematic structures in textual data, which is a step in the evolutionary development of thematic modeling methodologies. In addition, the results of the research not only open up the possibility of applying LDA to more complex linguistic scenarios, but also outline ways to further improve them for unsupervised text analysis.

Kharkiv National University of Radioelectronics

Volodymyr Narozhnyi Vyacheslav Kharchenko

INNOVATIVE TECHNOLOGIES AND SCIENTIFIC SOLUTIONS FOR INDUSTRIES

2024

Title: Semantic clustering method using integration of advanced LDA algorithm and BERT algorithm

Description:

Relevance.

Latent Dirichlet Allocation (LDA) is a fundamental topic modeling technique that is widely used in a variety of text analysis applications.

Although its usefulness is widely recognized, traditional LDA models often face limitations, such as a rigid distribution of topics and inadequate representation of semantic nuances inherent in natural language.

The purpose and main idea of the study is to improve the adequacy and accuracy of semantic analysis by improving the basic LDA mechanism that integrates adaptive Dirichlet priorities and exploits the deep semantic capabilities of BERT embeddings.

Research methods: 1) selection of textual datasets; 2) data preprocessing steps; 3) improvement of the LDA algorithm; 4) integration with BERT Embeddings; 5) comparative analysis.

Research objectives: 1) theoretical substantiation of LDA modification; 2) implementation of integration with BERT; 3) evaluation of the method efficiency; 4) comparative analysis; 5) development of an architectural solution.

The results of the research are that, first of all, the theoretical foundations of both the standard and modified LDA models are outlined, and their extended formula is presented in detail.

Through a series of experiments on text datasets characterized by different emotional states, we emphasize the key advantages of the proposed approach.

Based on a comparative analysis of such indicators as intra- and inter-cluster distances and silhouette coefficient, we prove the increased coherence, interpretability, and adaptability of the modified LDA model.

An architectural solution for implementing the method is proposed.

Conclusions.

The empirical results indicate a significant improvement in the detection of subtle complexities and thematic structures in textual data, which is a step in the evolutionary development of thematic modeling methodologies.

In addition, the results of the research not only open up the possibility of applying LDA to more complex linguistic scenarios, but also outline ways to further improve them for unsupervised text analysis.

Back

Предметом дослідження є поглиблений семантичний аналіз даних, що базується на інтеграції методологій латентного розподілу Діріхле (LDA) та двонаправленого кодувального представленн...

Image clustering using exponential discriminant analysis

Local learning based image clustering models are usually employed to deal with images sampled from the non‐linear manifold. Recently, linear discriminant analysis (LDA) based vario...

The Kernel Rough K-Means Algorithm

Background: Clustering is one of the most important data mining methods. The k-means (c-means ) and its derivative methods are the hotspot in the field of clustering research in re...

A Semantic Orthogonal Mapping Method Through Deep-Learning for Semantic Computing

In order to realize an artificial intelligent system, a basic mechanism should be provided for expressing and processing the semantic. We have presented semantic computing models i...

Over-Sampling Effect in Pre-Training for Bidirectional Encoder Representations from Transformers (BERT) to Localize Medical BERT and Enhance Biomedical BERT (Preprint)

BACKGROUND Pre-training large-scale neural language models on raw texts has made a significant contribution to improving transfer learning in natural langua...

A Pre-Training Technique to Localize Medical BERT and to Enhance Biomedical BERT

Abstract Background: Pre-training large-scale neural language models on raw texts has been shown to make a significant contribution to a strategy for transfer learning in n...

Parallel density clustering algorithm based on MapReduce and optimized cuckoo algorithm

In the process of parallel density clustering, the boundary points of clusters with different densities are blurred and there is data noise, which affects the clustering performanc...

Comment text clustering algorithm based on improved DEC

Aiming at the problem that the initial number of clusters and cluster centers obtained by the clustering layer in the original deep embedding clustering (DEC) algorithm are highly ...

Email:
Password:

Email:

Semantic clustering method using integration of advanced LDA algorithm and BERT algorithm

Related Results