Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Code stylometry vs formatting and minification

View through CrossRef
The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to improvements in machine learning-based techniques for author recognition. Once feasible at scale, code stylometry can be used for well-intended or malevolent activities, including: identifying the most expert coworker on a piece of code (if authorship information goes missing); fingerprinting open source developers to pitch them unsolicited job offers; de-anonymizing developers of illegal software to pursue them. Depending on their respective goals, stakeholders have an interest in making code stylometry either more or less effective. To inform these decisions we investigate how the accuracy of code stylometry is impacted by two common software development activities: code formatting and code minification. We perform code stylometry on Python code from the Google Code Jam dataset (59 authors) using a code2vec-based author classifier on concrete syntax tree (CST) representations of input source files. We conduct the experiment using both CSTs and ASTs (abstract syntax trees). We compare the respective classification accuracies on: (1) the original dataset, (2) the dataset formatted with Black, and (3) the dataset minified with Python Minifier. Our results show that: (1) CST-based stylometry performs better than AST-based (51.00%→68%), (2) code formatting makes a significant dent (15%) in code stylometry accuracy (68%→53%), with minification subtracting a further 3% (68%→50%). While the accuracy reduction is significant for both code formatting and minification, neither is enough to make developers non-recognizable via code stylometry.
Title: Code stylometry vs formatting and minification
Description:
The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to improvements in machine learning-based techniques for author recognition.
Once feasible at scale, code stylometry can be used for well-intended or malevolent activities, including: identifying the most expert coworker on a piece of code (if authorship information goes missing); fingerprinting open source developers to pitch them unsolicited job offers; de-anonymizing developers of illegal software to pursue them.
Depending on their respective goals, stakeholders have an interest in making code stylometry either more or less effective.
To inform these decisions we investigate how the accuracy of code stylometry is impacted by two common software development activities: code formatting and code minification.
We perform code stylometry on Python code from the Google Code Jam dataset (59 authors) using a code2vec-based author classifier on concrete syntax tree (CST) representations of input source files.
We conduct the experiment using both CSTs and ASTs (abstract syntax trees).
We compare the respective classification accuracies on: (1) the original dataset, (2) the dataset formatted with Black, and (3) the dataset minified with Python Minifier.
Our results show that: (1) CST-based stylometry performs better than AST-based (51.
00%→68%), (2) code formatting makes a significant dent (15%) in code stylometry accuracy (68%→53%), with minification subtracting a further 3% (68%→50%).
While the accuracy reduction is significant for both code formatting and minification, neither is enough to make developers non-recognizable via code stylometry.

Related Results

Stylometry for real-world expert coders: a zero-shot approach
Stylometry for real-world expert coders: a zero-shot approach
Code stylometry is the application of stylometry techniques to determine the authorship of software source code snippets. It is used in the industry to address use cases like plagi...
Does Coding Style Really Survive Compilation? Stylometry of Executable Code Revisited
Does Coding Style Really Survive Compilation? Stylometry of Executable Code Revisited
This paper describes a replication study of influential recent work on binary-level code stylometry by Caliskan et al. [8]. Using the Google Code Jam (GCJ) dataset that the origina...
Design of Malicious Code Detection System Based on Binary Code Slicing
Design of Malicious Code Detection System Based on Binary Code Slicing
<p>Malicious code threatens the safety of computer systems. Researching malicious code design techniques and mastering code behavior patterns are the basic work of network se...
Alih Kode Dan Campur Kode Dalam Interaksi Masyarakat Terminal Motabuik Kota Atambua
Alih Kode Dan Campur Kode Dalam Interaksi Masyarakat Terminal Motabuik Kota Atambua
This research aims to describe the use of language in community interactions at the Motabuik terminal, Atambua City. The use of language in question is the form and function of cod...
Systematic Evaluation of AI-Generated Python Code: A Comparative Study across Progressive Programming Tasks
Systematic Evaluation of AI-Generated Python Code: A Comparative Study across Progressive Programming Tasks
Abstract Background: AI-based code assistants are on the rise in software development as powerful technologies offering streamlining of code generation and better-quality c...
CODE CHOICE USED BY CHIYU TAMADE (CHU2) CHARACTER IN THE ANIME “BANG DREAM! SEASON 2 EP 3, 8, AND 9”
CODE CHOICE USED BY CHIYU TAMADE (CHU2) CHARACTER IN THE ANIME “BANG DREAM! SEASON 2 EP 3, 8, AND 9”
In the field of Sociolinguistics, phenomenons of language use such as code-switching and code-mixing are often found in our daily lives. BanG Dream is a multimedia project that foc...

Back to Top