Javascript must be enabled to continue!
Stylometry for real-world expert coders: a zero-shot approach
View through CrossRef
Code stylometry is the application of stylometry techniques to determine the authorship of software source code snippets. It is used in the industry to address use cases like plagiarism detection, code audits, and code review assignments. Most works in the code stylometry literature use machine learning techniques and (1) rely on datasets coming from in vitro coding competition for training, and (2) only attempt to recognize authors present in the training dataset (in-distribution authors). In this work we give a fresh look at code stylometry and challenge both these assumptions: (1) we recognize expert authors who contribute to real-world open-source projects, and (2) we show how to accurately recognize authors not present in the training set (out-distribution authors). We assemble a novel open dataset of code snippets for code stylometry tasks consisting of 114,400 code snippets, authored by 104 authors having contributed 1,100 snippets each. We develop a K-nearest neighbors algorithm (k-NN) classifier for the code stylometry task and train it on the dataset. Our system achieves a top accuracy of 69% among five randomly selected in-distribution authors, thus improving state of the art by more than 20%. We also show that when moving from in-distribution to out-distribution authors, the classification performances of the k-NN classifier remain the same, achieving a top accuracy of 71% among five randomly-selected out-distribution authors.
Title: Stylometry for real-world expert coders: a zero-shot approach
Description:
Code stylometry is the application of stylometry techniques to determine the authorship of software source code snippets.
It is used in the industry to address use cases like plagiarism detection, code audits, and code review assignments.
Most works in the code stylometry literature use machine learning techniques and (1) rely on datasets coming from in vitro coding competition for training, and (2) only attempt to recognize authors present in the training dataset (in-distribution authors).
In this work we give a fresh look at code stylometry and challenge both these assumptions: (1) we recognize expert authors who contribute to real-world open-source projects, and (2) we show how to accurately recognize authors not present in the training set (out-distribution authors).
We assemble a novel open dataset of code snippets for code stylometry tasks consisting of 114,400 code snippets, authored by 104 authors having contributed 1,100 snippets each.
We develop a K-nearest neighbors algorithm (k-NN) classifier for the code stylometry task and train it on the dataset.
Our system achieves a top accuracy of 69% among five randomly selected in-distribution authors, thus improving state of the art by more than 20%.
We also show that when moving from in-distribution to out-distribution authors, the classification performances of the k-NN classifier remain the same, achieving a top accuracy of 71% among five randomly-selected out-distribution authors.
Related Results
Code stylometry vs formatting and minification
Code stylometry vs formatting and minification
The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to impr...
COMPARISON OF VISULALLY LOSSLESS COMPRESSION OF DENTAL IMAGES BY DIFFERENT CODERS BASED ON HAARPSI METRIC
COMPARISON OF VISULALLY LOSSLESS COMPRESSION OF DENTAL IMAGES BY DIFFERENT CODERS BASED ON HAARPSI METRIC
The object of the study is the process of visually lossless compression of dental images by means of five coders using HaarPSI metrics and its distortion invisibility threshold. Th...
EMNet: A Novel Few-Shot Image Classification Model with Enhanced Self-Correlation Attention and Multi-Branch Joint Module
EMNet: A Novel Few-Shot Image Classification Model with Enhanced Self-Correlation Attention and Multi-Branch Joint Module
In this research, inspired by the principles of biological visual attention mechanisms and swarm intelligence found in nature, we present an Enhanced Self-Correlation Attention and...
Study on hardness and wear resistance of shot peened AA7075-T6 aluminum alloy
Study on hardness and wear resistance of shot peened AA7075-T6 aluminum alloy
Abstract
AA7075-T6 aluminum alloy samples were shot peened at various shot peening pressures in the range of 10–70 psi to study their mechanical and tribological ...
A Comparative Analysis of Financial Tariff Coding Accuracy Between NonClinical Coders and Orthopaedic Surgeons in A Large District Hospital–A Prospective Controlled Study
A Comparative Analysis of Financial Tariff Coding Accuracy Between NonClinical Coders and Orthopaedic Surgeons in A Large District Hospital–A Prospective Controlled Study
Accurate coding for trauma and orthopaedic surgical patients is crucial for reliable data collection, influencing income generation, national statistical
analysis, and clinical per...
Comparative Evaluation of Zero-Shot and Few-Shot Performance of Large Language Models in Low-Resource Language Machine Translation
Comparative Evaluation of Zero-Shot and Few-Shot Performance of Large Language Models in Low-Resource Language Machine Translation
Large language models (LLMs) have demonstrated remarkable translation capabilities for high-resource languages, yet their effectiveness on low-resource languages under varying prom...
Identify Cricket Shots using Machine Learning
Identify Cricket Shots using Machine Learning
Cricket shot detection is a game-changing technology that offers deep insights into player performance and match data, completely changing the way the sport is played. The main ele...
Automatic Acquisition Method and Empirical Research of Shot Length in Chinese Films based on Machine Vision
Automatic Acquisition Method and Empirical Research of Shot Length in Chinese Films based on Machine Vision
Abstract
The measurement of shot length is an essential index for the evaluation of cinematographic research. Given the limitations of existing measurement tools, which req...

