Javascript must be enabled to continue!
Stylometry for real-world expert coders: a zero-shot approach
View through CrossRef
Code stylometry is the application of stylometry techniques to determine the authorship of software source code snippets. It is used in the industry to address use cases like plagiarism detection, code audits, and code review assignments. Most works in the code stylometry literature use machine learning techniques and (1) rely on datasets coming from in vitro coding competition for training, and (2) only attempt to recognize authors present in the training dataset (in-distribution authors). In this work we give a fresh look at code stylometry and challenge both these assumptions: (1) we recognize expert authors who contribute to real-world open-source projects, and (2) we show how to accurately recognize authors not present in the training set (out-distribution authors). We assemble a novel open dataset of code snippets for code stylometry tasks consisting of 114,400 code snippets, authored by 104 authors having contributed 1,100 snippets each. We develop a K-nearest neighbors algorithm (k-NN) classifier for the code stylometry task and train it on the dataset. Our system achieves a top accuracy of 69% among five randomly selected in-distribution authors, thus improving state of the art by more than 20%. We also show that when moving from in-distribution to out-distribution authors, the classification performances of the k-NN classifier remain the same, achieving a top accuracy of 71% among five randomly-selected out-distribution authors.
Title: Stylometry for real-world expert coders: a zero-shot approach
Description:
Code stylometry is the application of stylometry techniques to determine the authorship of software source code snippets.
It is used in the industry to address use cases like plagiarism detection, code audits, and code review assignments.
Most works in the code stylometry literature use machine learning techniques and (1) rely on datasets coming from in vitro coding competition for training, and (2) only attempt to recognize authors present in the training dataset (in-distribution authors).
In this work we give a fresh look at code stylometry and challenge both these assumptions: (1) we recognize expert authors who contribute to real-world open-source projects, and (2) we show how to accurately recognize authors not present in the training set (out-distribution authors).
We assemble a novel open dataset of code snippets for code stylometry tasks consisting of 114,400 code snippets, authored by 104 authors having contributed 1,100 snippets each.
We develop a K-nearest neighbors algorithm (k-NN) classifier for the code stylometry task and train it on the dataset.
Our system achieves a top accuracy of 69% among five randomly selected in-distribution authors, thus improving state of the art by more than 20%.
We also show that when moving from in-distribution to out-distribution authors, the classification performances of the k-NN classifier remain the same, achieving a top accuracy of 71% among five randomly-selected out-distribution authors.
Related Results
Code stylometry vs formatting and minification
Code stylometry vs formatting and minification
The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to impr...
A Comparative Analysis of Financial Tariff Coding Accuracy Between NonClinical Coders and Orthopaedic Surgeons in A Large District Hospital–A Prospective Controlled Study
A Comparative Analysis of Financial Tariff Coding Accuracy Between NonClinical Coders and Orthopaedic Surgeons in A Large District Hospital–A Prospective Controlled Study
Accurate coding for trauma and orthopaedic surgical patients is crucial for reliable data collection, influencing income generation, national statistical
analysis, and clinical per...
Automatic Acquisition Method and Empirical Research of Shot Length in Chinese Films based on Machine Vision
Automatic Acquisition Method and Empirical Research of Shot Length in Chinese Films based on Machine Vision
Abstract
The measurement of shot length is an essential index for the evaluation of cinematographic research. Given the limitations of existing measurement tools, which req...
Evaluation of Prompting Strategies for Cyberbullying Detection Using Various Large Language Models
Evaluation of Prompting Strategies for Cyberbullying Detection Using Various Large Language Models
Sentiment analysis detects toxic language for safer online spaces and helps businesses refine
strategies through customer feedback analysis [1, 2]. Advancements in Large Language
M...
Does Coding Style Really Survive Compilation? Stylometry of Executable Code Revisited
Does Coding Style Really Survive Compilation? Stylometry of Executable Code Revisited
This paper describes a replication study of influential recent work on binary-level code stylometry by Caliskan et al. [8]. Using the Google Code Jam (GCJ) dataset that the origina...
Enhancing Self-Navigated Interleaved Spiral with ESPIRiT (eSNAILS)
Enhancing Self-Navigated Interleaved Spiral with ESPIRiT (eSNAILS)
Motivation: Current methods for estimation of shot-to-shot phase variations in multi-shot DWI may not fully exploit the correlations in data. Goal(s): To propose a method which eff...
Visualization of Brushing in “Shibo” Production of Hand-Made Japanese Paper
Visualization of Brushing in “Shibo” Production of Hand-Made Japanese Paper
We deal with “Danshi” that has a special wrinkle called “Shibo” on the surface structure of the Japanese paper. Three sheets of wet papers are superimposed to make the Shibo. Wet p...
The article is dedicated to the anniversary of Boris Vasilyevich Markov, the famous philosopher of Saint Petersburg, Russia. The author of the article, basing on many years of personal experience and professional communication with the hero of the day, pr
The article is dedicated to the anniversary of Boris Vasilyevich Markov, the famous philosopher of Saint Petersburg, Russia. The author of the article, basing on many years of personal experience and professional communication with the hero of the day, pr
The article examines the epistemological parameters of the phenomenon of expert examination as well as the social and cognitive features of using scientific knowledge to substantia...

