Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Does Coding Style Really Survive Compilation? Stylometry of Executable Code Revisited

View through CrossRef
This paper describes a replication study of influential recent work on binary-level code stylometry by Caliskan et al. [8]. Using the Google Code Jam (GCJ) dataset that the original work used but with possible differences in authors and tasks, the accuracy results we obtain are significantly lower than those originally reported. An analysis of the features that contribute most to author classification decisions indicates that such features may, in many cases, be accidental artifacts---e.g., due to erroneous disassembly of data bytes embedded in the binary---and have little to do with programming style. Our results suggest that binary-level code stylometry. (1) is more sensitive to code characteristics than previously suspected; (2) can be significantly less accurate than previously reported (for 100 authors, we achieved approximately 63% accuracy, compared to the 96% reported in the original work); and (3) deserves careful attention to accidental artifacts arising from the compilation and stylometry toolchains. We found 29/33 of top ndisasm-based features resulted from erroneous disassembly. Our analysis revealed that this might cause the model to pick spurious features, i.e., the original file name, as the g++ compiler embeds the filename of the source CPP file into the binary -- which might unknowingly inflate the results.
Title: Does Coding Style Really Survive Compilation? Stylometry of Executable Code Revisited
Description:
This paper describes a replication study of influential recent work on binary-level code stylometry by Caliskan et al.
[8].
Using the Google Code Jam (GCJ) dataset that the original work used but with possible differences in authors and tasks, the accuracy results we obtain are significantly lower than those originally reported.
An analysis of the features that contribute most to author classification decisions indicates that such features may, in many cases, be accidental artifacts---e.
g.
, due to erroneous disassembly of data bytes embedded in the binary---and have little to do with programming style.
Our results suggest that binary-level code stylometry.
(1) is more sensitive to code characteristics than previously suspected; (2) can be significantly less accurate than previously reported (for 100 authors, we achieved approximately 63% accuracy, compared to the 96% reported in the original work); and (3) deserves careful attention to accidental artifacts arising from the compilation and stylometry toolchains.
We found 29/33 of top ndisasm-based features resulted from erroneous disassembly.
Our analysis revealed that this might cause the model to pick spurious features, i.
e.
, the original file name, as the g++ compiler embeds the filename of the source CPP file into the binary -- which might unknowingly inflate the results.

Related Results

On Flores Island, do "ape-men" still exist? https://www.sapiens.org/biology/flores-island-ape-men/
On Flores Island, do "ape-men" still exist? https://www.sapiens.org/biology/flores-island-ape-men/
<span style="font-size:11pt"><span style="background:#f9f9f4"><span style="line-height:normal"><span style="font-family:Calibri,sans-serif"><b><spa...
Crescimento de feijoeiro sob influência de carvão vegetal e esterco bovino
Crescimento de feijoeiro sob influência de carvão vegetal e esterco bovino
<p align="justify"><span style="color: #000000;"><span style="font-family: 'Times New Roman', serif;"><span><span lang="pt-BR">É indiscutível a import...
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
Code stylometry vs formatting and minification
Code stylometry vs formatting and minification
The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to impr...
Stylometry for real-world expert coders: a zero-shot approach
Stylometry for real-world expert coders: a zero-shot approach
Code stylometry is the application of stylometry techniques to determine the authorship of software source code snippets. It is used in the industry to address use cases like plagi...
Even Star Decomposition of Complete Bipartite Graphs
Even Star Decomposition of Complete Bipartite Graphs
<p><span lang="EN-US"><span style="font-family: 宋体; font-size: medium;">A decomposition (</span><span><span style="font-family: 宋体; font-size: medi...
The Annual Performance Review As A Positive Source For Employee Motivation?
The Annual Performance Review As A Positive Source For Employee Motivation?
<p class="MsoNormal" style="text-align: justify; margin: 0in 0.5in 0pt; mso-pagination: none;"><span style="color: black; font-size: 10pt; mso-themecolor: text1;"><s...

Back to Top