Javascript must be enabled to continue!
Topic modeling in software engineering research
View through CrossRef
AbstractTopic modeling using models such as Latent Dirichlet Allocation (LDA) is a text mining technique to extract human-readable semantic “topics” (i.e., word clusters) from a corpus of textual documents. In software engineering, topic modeling has been used to analyze textual data in empirical studies (e.g., to find out what developers talk about online), but also to build new techniques to support software engineering tasks (e.g., to support source code comprehension). Topic modeling needs to be applied carefully (e.g., depending on the type of textual data analyzed and modeling parameters). Our study aims at describing how topic modeling has been applied in software engineering research with a focus on four aspects: (1) which topic models and modeling techniques have been applied, (2) which textual inputs have been used for topic modeling, (3) how textual data was “prepared” (i.e., pre-processed) for topic modeling, and (4) how generated topics (i.e., word clusters) were named to give them a human-understandable meaning. We analyzed topic modeling as applied in 111 papers from ten highly-ranked software engineering venues (five journals and five conferences) published between 2009 and 2020. We found that (1) LDA and LDA-based techniques are the most frequent topic modeling techniques, (2) developer communication and bug reports have been modelled most, (3) data pre-processing and modeling parameters vary quite a bit and are often vaguely reported, and (4) manual topic naming (such as deducting names based on frequent words in a topic) is common.
Springer Science and Business Media LLC
Title: Topic modeling in software engineering research
Description:
AbstractTopic modeling using models such as Latent Dirichlet Allocation (LDA) is a text mining technique to extract human-readable semantic “topics” (i.
e.
, word clusters) from a corpus of textual documents.
In software engineering, topic modeling has been used to analyze textual data in empirical studies (e.
g.
, to find out what developers talk about online), but also to build new techniques to support software engineering tasks (e.
g.
, to support source code comprehension).
Topic modeling needs to be applied carefully (e.
g.
, depending on the type of textual data analyzed and modeling parameters).
Our study aims at describing how topic modeling has been applied in software engineering research with a focus on four aspects: (1) which topic models and modeling techniques have been applied, (2) which textual inputs have been used for topic modeling, (3) how textual data was “prepared” (i.
e.
, pre-processed) for topic modeling, and (4) how generated topics (i.
e.
, word clusters) were named to give them a human-understandable meaning.
We analyzed topic modeling as applied in 111 papers from ten highly-ranked software engineering venues (five journals and five conferences) published between 2009 and 2020.
We found that (1) LDA and LDA-based techniques are the most frequent topic modeling techniques, (2) developer communication and bug reports have been modelled most, (3) data pre-processing and modeling parameters vary quite a bit and are often vaguely reported, and (4) manual topic naming (such as deducting names based on frequent words in a topic) is common.
Related Results
Software industry awareness on sustainable software engineering: a Brazilian perspective
Software industry awareness on sustainable software engineering: a Brazilian perspective
Sustainable computing is a rapidly growing research area spanning several areas of computer science. In the software engineering field, the topic has received increasing attention ...
Software Assurance
Software Assurance
Abstract
Confidence in software quality is a rare commodity throughout all industries. Software publishers, users, and system integrators are highly distrustful of anyone...
Comparative Analysis of Topic Modeling Algorithms for Short Texts in Persian Tweets
Comparative Analysis of Topic Modeling Algorithms for Short Texts in Persian Tweets
Abstract
Topic modeling is a popular natural language processing technique to uncover hidden patterns and topics in extensive text collections. However, there is a lack of ...
Performance simulation methodologies for hardware/software co-designed processors
Performance simulation methodologies for hardware/software co-designed processors
Recently the community started looking into Hardware/Software (HW/SW) co-designed processors as potential solutions to move towards the less power consuming and the less complex de...
ELIXIR Europe on the Road to Sustainable Research Software
ELIXIR Europe on the Road to Sustainable Research Software
ELIXIR (ELIXIR Europe 2019a) is an intergovernmental organization that brings together life science resources across Europe. These resources include databases, software tools, trai...
Modeling Techniques for Software-Intensive Systems
Modeling Techniques for Software-Intensive Systems
Software has become the driving force in the evolution of many systems, such as embedded systems (especially automotive applications), telecommunication systems, and large scale he...
Pengaruh Kadar Air dan Kadar Abu terhadap Nilai Kalori Batubara Berdasarkan Analisis Rergesi Linier Berganda
Pengaruh Kadar Air dan Kadar Abu terhadap Nilai Kalori Batubara Berdasarkan Analisis Rergesi Linier Berganda
Abstract. Coal contains moisture in the air, ash, volatiles, and fixed carbon. Proximate analysis was conducted to determine these contents, and the calorific value of the coal was...
Ethics: What is the Brazilian Software Engineering Research Scenario?
Ethics: What is the Brazilian Software Engineering Research Scenario?
Background: Ethics is the theory or science of the moral behavior of humans in society. Traditionally, we value “unethical” actions that go against determining morality in a specif...


