Javascript must be enabled to continue!
Test smells 20 years later: detectability, validity, and reliability
View through CrossRef
AbstractTest smells aim to capture design issues in test code that reduces its maintainability. These have been extensively studied and generally found quite prevalent in both human-written and automatically generated test-cases. However, most evidence of prevalence is based on specific static detection rules. Although those are based on the original, conceptual definitions of the various test smells, recent empirical studies indicate that developers perceive warnings raised by detection tools as overly strict and non-representative of the maintainability and quality of test suites. This leads us to re-assess test smell detection tools’ detection accuracy and investigate the prevalence and detectability of test smells more broadly. Specifically, we construct a hand-annotated dataset spanning hundreds of test suites both written by developers and generated by two test generation tools (EvoSuiteandJTExpert) and performed a multi-stage, cross-validated manual analysis to identify the presence of six types of test smells in these. We then use this manual labeling to benchmark the performance and external validity of two test smell detection tools—one widely used in prior work and one recently introduced with the express goal to match developer perceptions of test smells. Our results primarily show that the current vocabulary of test smells is highly mismatched to real concerns: multiple smells were ubiquitous on developer-written tests but virtually never correlated with semantic or maintainability flaws; machine-generated tests actually often scored better, but in reality, suffered from a host of problems not well-captured by current test smells. Current test smell detection strategies poorly characterized the issues in these automatically generated test suites; in particular, the older tool’s detection strategies misclassified over 70% of test smells, both missing real instances (false negatives) and marking many smell-free tests as smelly (false positives). We identify common patterns in these tests that can be used to improve the tools, refine and update the definition of certain test smells, and highlight as of yet uncharacterized issues. Our findings suggest the need for (i) more appropriate metrics to match development practice, (ii) more accurate detection strategies to be evaluated primarily in industrial contexts.
Springer Science and Business Media LLC
Title: Test smells 20 years later: detectability, validity, and reliability
Description:
AbstractTest smells aim to capture design issues in test code that reduces its maintainability.
These have been extensively studied and generally found quite prevalent in both human-written and automatically generated test-cases.
However, most evidence of prevalence is based on specific static detection rules.
Although those are based on the original, conceptual definitions of the various test smells, recent empirical studies indicate that developers perceive warnings raised by detection tools as overly strict and non-representative of the maintainability and quality of test suites.
This leads us to re-assess test smell detection tools’ detection accuracy and investigate the prevalence and detectability of test smells more broadly.
Specifically, we construct a hand-annotated dataset spanning hundreds of test suites both written by developers and generated by two test generation tools (EvoSuiteandJTExpert) and performed a multi-stage, cross-validated manual analysis to identify the presence of six types of test smells in these.
We then use this manual labeling to benchmark the performance and external validity of two test smell detection tools—one widely used in prior work and one recently introduced with the express goal to match developer perceptions of test smells.
Our results primarily show that the current vocabulary of test smells is highly mismatched to real concerns: multiple smells were ubiquitous on developer-written tests but virtually never correlated with semantic or maintainability flaws; machine-generated tests actually often scored better, but in reality, suffered from a host of problems not well-captured by current test smells.
Current test smell detection strategies poorly characterized the issues in these automatically generated test suites; in particular, the older tool’s detection strategies misclassified over 70% of test smells, both missing real instances (false negatives) and marking many smell-free tests as smelly (false positives).
We identify common patterns in these tests that can be used to improve the tools, refine and update the definition of certain test smells, and highlight as of yet uncharacterized issues.
Our findings suggest the need for (i) more appropriate metrics to match development practice, (ii) more accurate detection strategies to be evaluated primarily in industrial contexts.
Related Results
Exploring Test Smells Across Programming Languages: A Systematic Mapping Study
Exploring Test Smells Across Programming Languages: A Systematic Mapping Study
Tests are essential for ensuring code quality in software development.
However, poor implementation practices can compromise the
maintainability and evolution of test code, leading...
Domination of Polynomial with Application
Domination of Polynomial with Application
In this paper, .We .initiate the study of domination. polynomial , consider G=(V,E) be a simple, finite, and directed graph without. isolated. vertex .We present a study of the Ira...
Detectability of an intermediate layer by magnetotelluric sounding
Detectability of an intermediate layer by magnetotelluric sounding
Abstract
The recent publication by Verma and Mallick (1979) on the detectability of an intermediate layer by time domain EM sounding provides some informative ans...
Fixing Dockerfile smells: an empirical study
Fixing Dockerfile smells: an empirical study
AbstractDocker is the de facto standard for software containerization. A Dockerfile contains the requirements to build a Docker image containing a target application. There are sev...
Discovering code smells in Javascript software using clustering techniques
Discovering code smells in Javascript software using clustering techniques
A presença de code smells em projetos de software têm consequências negativas no que diz respeito a coesão e manutenibilidade do código. Assim sendo, a análise de técnicas usadas p...
Provocative Tests in Diagnosis of Thoracic Outlet Syndrome: A Narrative Review
Provocative Tests in Diagnosis of Thoracic Outlet Syndrome: A Narrative Review
Abstract
Thoracic outlet syndrome (TOS) is a group of conditions caused by the compression of the neurovascular bundle within the thoracic outlet. It is classified into three main ...
The Impact of Code Smells on Software Bugs: a Systematic Literature Review
The Impact of Code Smells on Software Bugs: a Systematic Literature Review
Context: Code smells are associated with poor design and programming style that often degrades code quality and hampers code comprehensibility and maintainability. Goal: Identify r...
Explaining the Imperfect: How do LLMs Respond to Smelly Code?
Explaining the Imperfect: How do LLMs Respond to Smelly Code?
Code smells are indicators of suboptimal design or implementation that contribute to technical debt, impairing software comprehensibility and maintainability. While Large Language ...

