Javascript must be enabled to continue!

Stylometry for real-world expert coders: a zero-shot approach

Code stylometry is the application of stylometry techniques to determine the authorship of software source code snippets. It is used in the industry to address use cases like plagiarism detection, code audits, and code review assignments. Most works in the code stylometry literature use machine learning techniques and (1) rely on datasets coming from in vitro coding competition for training, and (2) only attempt to recognize authors present in the training dataset (in-distribution authors). In this work we give a fresh look at code stylometry and challenge both these assumptions: (1) we recognize expert authors who contribute to real-world open-source projects, and (2) we show how to accurately recognize authors not present in the training set (out-distribution authors). We assemble a novel open dataset of code snippets for code stylometry tasks consisting of 114,400 code snippets, authored by 104 authors having contributed 1,100 snippets each. We develop a K-nearest neighbors algorithm (k-NN) classifier for the code stylometry task and train it on the dataset. Our system achieves a top accuracy of 69% among five randomly selected in-distribution authors, thus improving state of the art by more than 20%. We also show that when moving from in-distribution to out-distribution authors, the classification performances of the k-NN classifier remain the same, achieving a top accuracy of 71% among five randomly-selected out-distribution authors.

PeerJ

Andrea Gurioli Maurizio Gabbrielli Stefano Zacchiroli

PeerJ Computer Science

2024

Title: Stylometry for real-world expert coders: a zero-shot approach

Description:

Code stylometry is the application of stylometry techniques to determine the authorship of software source code snippets.

It is used in the industry to address use cases like plagiarism detection, code audits, and code review assignments.

Most works in the code stylometry literature use machine learning techniques and (1) rely on datasets coming from in vitro coding competition for training, and (2) only attempt to recognize authors present in the training dataset (in-distribution authors).

In this work we give a fresh look at code stylometry and challenge both these assumptions: (1) we recognize expert authors who contribute to real-world open-source projects, and (2) we show how to accurately recognize authors not present in the training set (out-distribution authors).

We assemble a novel open dataset of code snippets for code stylometry tasks consisting of 114,400 code snippets, authored by 104 authors having contributed 1,100 snippets each.

We develop a K-nearest neighbors algorithm (k-NN) classifier for the code stylometry task and train it on the dataset.

Our system achieves a top accuracy of 69% among five randomly selected in-distribution authors, thus improving state of the art by more than 20%.

We also show that when moving from in-distribution to out-distribution authors, the classification performances of the k-NN classifier remain the same, achieving a top accuracy of 71% among five randomly-selected out-distribution authors.

Back

The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to impr...

A Comparative Analysis of Financial Tariff Coding Accuracy Between NonClinical Coders and Orthopaedic Surgeons in A Large District Hospital–A Prospective Controlled Study

Accurate coding for trauma and orthopaedic surgical patients is crucial for reliable data collection, influencing income generation, national statistical analysis, and clinical per...

Automatic Acquisition Method and Empirical Research of Shot Length in Chinese Films based on Machine Vision

Abstract The measurement of shot length is an essential index for the evaluation of cinematographic research. Given the limitations of existing measurement tools, which req...

Evaluation of Prompting Strategies for Cyberbullying Detection Using Various Large Language Models

Sentiment analysis detects toxic language for safer online spaces and helps businesses refine strategies through customer feedback analysis [1, 2]. Advancements in Large Language M...

Does Coding Style Really Survive Compilation? Stylometry of Executable Code Revisited

This paper describes a replication study of influential recent work on binary-level code stylometry by Caliskan et al. [8]. Using the Google Code Jam (GCJ) dataset that the origina...

Enhancing Self-Navigated Interleaved Spiral with ESPIRiT (eSNAILS)

Motivation: Current methods for estimation of shot-to-shot phase variations in multi-shot DWI may not fully exploit the correlations in data. Goal(s): To propose a method which eff...

Visualization of Brushing in “Shibo” Production of Hand-Made Japanese Paper

We deal with “Danshi” that has a special wrinkle called “Shibo” on the surface structure of the Japanese paper. Three sheets of wet papers are superimposed to make the Shibo. Wet p...

The article is dedicated to the anniversary of Boris Vasilyevich Markov, the famous philosopher of Saint Petersburg, Russia. The author of the article, basing on many years of personal experience and professional communication with the hero of the day, pr

The article examines the epistemological parameters of the phenomenon of expert examination as well as the social and cognitive features of using scientific knowledge to substantia...

Email:
Password:

Email:

Stylometry for real-world expert coders: a zero-shot approach

Related Results