Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Stylometry for real-world expert coders: a zero-shot approach

View through CrossRef
Code stylometry is the application of stylometry techniques to determine the authorship of software source code snippets. It is used in the industry to address use cases like plagiarism detection, code audits, and code review assignments. Most works in the code stylometry literature use machine learning techniques and (1) rely on datasets coming from in vitro coding competition for training, and (2) only attempt to recognize authors present in the training dataset (in-distribution authors). In this work we give a fresh look at code stylometry and challenge both these assumptions: (1) we recognize expert authors who contribute to real-world open-source projects, and (2) we show how to accurately recognize authors not present in the training set (out-distribution authors). We assemble a novel open dataset of code snippets for code stylometry tasks consisting of 114,400 code snippets, authored by 104 authors having contributed 1,100 snippets each. We develop a K-nearest neighbors algorithm (k-NN) classifier for the code stylometry task and train it on the dataset. Our system achieves a top accuracy of 69% among five randomly selected in-distribution authors, thus improving state of the art by more than 20%. We also show that when moving from in-distribution to out-distribution authors, the classification performances of the k-NN classifier remain the same, achieving a top accuracy of 71% among five randomly-selected out-distribution authors.
Title: Stylometry for real-world expert coders: a zero-shot approach
Description:
Code stylometry is the application of stylometry techniques to determine the authorship of software source code snippets.
It is used in the industry to address use cases like plagiarism detection, code audits, and code review assignments.
Most works in the code stylometry literature use machine learning techniques and (1) rely on datasets coming from in vitro coding competition for training, and (2) only attempt to recognize authors present in the training dataset (in-distribution authors).
In this work we give a fresh look at code stylometry and challenge both these assumptions: (1) we recognize expert authors who contribute to real-world open-source projects, and (2) we show how to accurately recognize authors not present in the training set (out-distribution authors).
We assemble a novel open dataset of code snippets for code stylometry tasks consisting of 114,400 code snippets, authored by 104 authors having contributed 1,100 snippets each.
We develop a K-nearest neighbors algorithm (k-NN) classifier for the code stylometry task and train it on the dataset.
Our system achieves a top accuracy of 69% among five randomly selected in-distribution authors, thus improving state of the art by more than 20%.
We also show that when moving from in-distribution to out-distribution authors, the classification performances of the k-NN classifier remain the same, achieving a top accuracy of 71% among five randomly-selected out-distribution authors.

Related Results

Code stylometry vs formatting and minification
Code stylometry vs formatting and minification
The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to impr...
Automatic Acquisition Method and Empirical Research of Shot Length in Chinese Films based on Machine Vision
Automatic Acquisition Method and Empirical Research of Shot Length in Chinese Films based on Machine Vision
Abstract The measurement of shot length is an essential index for the evaluation of cinematographic research. Given the limitations of existing measurement tools, which req...
Evaluation of Prompting Strategies for Cyberbullying Detection Using Various Large Language Models
Evaluation of Prompting Strategies for Cyberbullying Detection Using Various Large Language Models
Sentiment analysis detects toxic language for safer online spaces and helps businesses refine strategies through customer feedback analysis [1, 2]. Advancements in Large Language M...
Does Coding Style Really Survive Compilation? Stylometry of Executable Code Revisited
Does Coding Style Really Survive Compilation? Stylometry of Executable Code Revisited
This paper describes a replication study of influential recent work on binary-level code stylometry by Caliskan et al. [8]. Using the Google Code Jam (GCJ) dataset that the origina...
Enhancing Self-Navigated Interleaved Spiral with ESPIRiT (eSNAILS)
Enhancing Self-Navigated Interleaved Spiral with ESPIRiT (eSNAILS)
Motivation: Current methods for estimation of shot-to-shot phase variations in multi-shot DWI may not fully exploit the correlations in data. Goal(s): To propose a method which eff...
Visualization of Brushing in “Shibo” Production of Hand-Made Japanese Paper
Visualization of Brushing in “Shibo” Production of Hand-Made Japanese Paper
We deal with “Danshi” that has a special wrinkle called “Shibo” on the surface structure of the Japanese paper. Three sheets of wet papers are superimposed to make the Shibo. Wet p...

Back to Top