Javascript must be enabled to continue!

Code stylometry vs formatting and minification

The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to improvements in machine learning-based techniques for author recognition. Once feasible at scale, code stylometry can be used for well-intended or malevolent activities, including: identifying the most expert coworker on a piece of code (if authorship information goes missing); fingerprinting open source developers to pitch them unsolicited job offers; de-anonymizing developers of illegal software to pursue them. Depending on their respective goals, stakeholders have an interest in making code stylometry either more or less effective. To inform these decisions we investigate how the accuracy of code stylometry is impacted by two common software development activities: code formatting and code minification. We perform code stylometry on Python code from the Google Code Jam dataset (59 authors) using a code2vec-based author classifier on concrete syntax tree (CST) representations of input source files. We conduct the experiment using both CSTs and ASTs (abstract syntax trees). We compare the respective classification accuracies on: (1) the original dataset, (2) the dataset formatted with Black, and (3) the dataset minified with Python Minifier. Our results show that: (1) CST-based stylometry performs better than AST-based (51.00%→68%), (2) code formatting makes a significant dent (15%) in code stylometry accuracy (68%→53%), with minification subtracting a further 3% (68%→50%). While the accuracy reduction is significant for both code formatting and minification, neither is enough to make developers non-recognizable via code stylometry.

PeerJ

Stefano Balla Maurizio Gabbrielli Stefano Zacchiroli

PeerJ Computer Science

2024

Title: Code stylometry vs formatting and minification

Description:

Once feasible at scale, code stylometry can be used for well-intended or malevolent activities, including: identifying the most expert coworker on a piece of code (if authorship information goes missing); fingerprinting open source developers to pitch them unsolicited job offers; de-anonymizing developers of illegal software to pursue them.

Depending on their respective goals, stakeholders have an interest in making code stylometry either more or less effective.

To inform these decisions we investigate how the accuracy of code stylometry is impacted by two common software development activities: code formatting and code minification.

We perform code stylometry on Python code from the Google Code Jam dataset (59 authors) using a code2vec-based author classifier on concrete syntax tree (CST) representations of input source files.

We conduct the experiment using both CSTs and ASTs (abstract syntax trees).

We compare the respective classification accuracies on: (1) the original dataset, (2) the dataset formatted with Black, and (3) the dataset minified with Python Minifier.

Our results show that: (1) CST-based stylometry performs better than AST-based (51.

00%→68%), (2) code formatting makes a significant dent (15%) in code stylometry accuracy (68%→53%), with minification subtracting a further 3% (68%→50%).

While the accuracy reduction is significant for both code formatting and minification, neither is enough to make developers non-recognizable via code stylometry.

Back

Code stylometry is the application of stylometry techniques to determine the authorship of software source code snippets. It is used in the industry to address use cases like plagi...

Joint Beamforming and Aerial IRS Positioning Design for IRS-assisted MISO System with Multiple Access Points

<code>Intelligent reflecting surface (IRS) is a promising concept for </code><code>6G</code><code> wireless communications...

Joint Beamforming and Aerial IRS Positioning Design for IRS-assisted MISO System with Multiple Access Points

<code>Intelligent reflecting surface (IRS) is a promising concept for </code><code>6G</code><code> wireless communications...

Does Coding Style Really Survive Compilation? Stylometry of Executable Code Revisited

This paper describes a replication study of influential recent work on binary-level code stylometry by Caliskan et al. [8]. Using the Google Code Jam (GCJ) dataset that the origina...

Design of Malicious Code Detection System Based on Binary Code Slicing

Malicious code threatens the safety of computer systems. Researching malicious code design techniques and mastering code behavior patterns are the basic work of network se...

Alih Kode Dan Campur Kode Dalam Interaksi Masyarakat Terminal Motabuik Kota Atambua

This research aims to describe the use of language in community interactions at the Motabuik terminal, Atambua City. The use of language in question is the form and function of cod...

Systematic Evaluation of AI-Generated Python Code: A Comparative Study across Progressive Programming Tasks

Abstract Background: AI-based code assistants are on the rise in software development as powerful technologies offering streamlining of code generation and better-quality c...

CODE CHOICE USED BY CHIYU TAMADE (CHU2) CHARACTER IN THE ANIME “BANG DREAM! SEASON 2 EP 3, 8, AND 9”

In the field of Sociolinguistics, phenomenons of language use such as code-switching and code-mixing are often found in our daily lives. BanG Dream is a multimedia project that foc...

Email:
Password:

Email:

Code stylometry vs formatting and minification

Related Results