Anonymous programmers can be identified by analyzing coding style

GuiA · on Jan 22, 2015

Interesting to see this formalized. When I was in grad school and graded undergrad homework/exams, I could most definitely recognize the students by their coding styles after just a few assignments. Every student ends up developing their own habits, and they're quite easy to spot in something as repetitive as code.

I remember teaching a Matlab class for engineers and scientists that was about 50/50 male/female, and the women tended to have much neater code. Code written by males often had comments all jumbled up, inconsistent number of spaces between braces/operators/etc, incoherent variable names, worse names for functions, and so on.

digi_owl · on Jan 22, 2015

Reminds me of that story about why switchboard operators were women.

During the telegraph era, boys often hung around outside the office, as they could earn a bit of coin delivering telegrams.

With the introduction of the phone, they tried using those boys as switchboard operators. Problem was that they would prank the callers by either cross wiring calls or unplugging them mid-call.

So instead women were hired, as they didn't do that. Instead they would listen in on calls, and so became a source of gossip.

This was seen as an acceptable trade-off by the companies, as at least the calls where properly and reliably routed.

okasaki · on Jan 22, 2015

It's funny to think that if you reversed the genders the comment would at the bottom of the page rather than the top.

vertex-four · on Jan 22, 2015

On HN? Definitely not.

jrs99 · on Jan 22, 2015

could you spot plagiarism?

joelgrus · on Jan 22, 2015

My Code Jam solutions always shared a lot of code -- all the boilerplate for reading inputs, parsing integers, iterating over test cases, and writing out results.

Because of that, it seems like Code Jam is an artificially easy test case for this sort of identification -- I'm pretty sure a human could look at my solutions and conclude they were obviously all written by the same person.

lotophage · on Jan 22, 2015

Not adhering to style guides is now a privacy issue.

busterarm · on Jan 22, 2015

Is it odd that I've been thinking about this one for a while now. I don't really talk to blackhat folks on the regular anymore, but given how important OpSec is, it's really sad how easily identifiable most of their code ends up being. Malware authors brag like crazy ("oh, let me just go ahead and encrypt this payload with my handle as the key...") and I wouldn't be surprised if you found some of the same from the state-sponsored folks.

I'd really love to see someone like grugq weigh in with some thoughts here. The idea of having tools to parse and rewrite my code to be as generic as possible came to me years ago.

ObviousScience · on Jan 22, 2015

> The idea of having tools to parse and rewrite my code to be as generic as possible came to me years ago.

You basically want an obfuscator that replaces all the names of things with generics, and then randomly permutes blocks of code without changing the code paths possible in the final binary. (Perhaps some optimized-for-performance version of this, but that might identify the tool you use.)

It sounds relatively easy to write if you stick to certain coding guidelines (like using techniques amenable to static analysis).

However, this still won't work in some cases, because you'd need more advanced tools to handle profiling of what sized functions and such you ended up writing.

It would be interesting to try and write a tool which defeated any analysis of author patterns in the code, but would require understanding the program across the boundary of function calls, which is a difficult problem. (You probably couldn't write Turing complete code, for example.)

_ofdw · on Jan 22, 2015

>The idea of having tools to parse and rewrite my code to be as generic as possible came to me years ago.

I feel like awk could do this quite handily, as far as homogenizing spacing, cases, underscoring, etc.

benten10 · on Feb 4, 2015

I'm two weeks late, but this is the topic of my research! Not with code, but with text that follows a given style guide. Even after the standardization of text, it's actually quite remarkable how much you can distinguish authors.

jszymborski · on Jan 22, 2015

Reminds me of how telegraph receivers used to be able to identify transmitters of the telegraph by their "fist" (cadence or rhythm with which they signaled).

Here's a Schneier post about a Concordia University study about identifying e-mail authors. https://www.schneier.com/blog/archives/2011/08/identifying_p...

joemaller1 · on Jan 22, 2015

But can they tell us if TJ Holowaychuk is really only one person?

Sanddancer · on Jan 22, 2015

Probably. Most code collaborations will have people working on different sections of code, be it functions, etc. A program like this would probably break things down via function, and be able to figure out that there are at least two people involved in its making.

pcthrowaway · on Jan 22, 2015

some deeper digging (besides that one Quora post) seems to indicate that he is. I don't have links right now because it took me some hours to find them last time, but it should be said that he is definitely the person shown in his public profile pictures, and if his contributions were supplemented by shadow entities it seems likely that the progression of his work and direction it has taken would be more scattered.

TeMPOraL · on Jan 22, 2015

Could someone provide a comprehensive description of what on earth is TJ Holowaychuk? I did Google a bit, and I'm still confused. Is it the new __why?

juliangregorian · on Jan 22, 2015

No, he's just ridiculously prolific.

towelguy · on Jan 22, 2015

Can they tell us if Satoshi Nakamoto is actually only one person?

patrickmay · on Jan 22, 2015

It's interesting that they only analyze the abstract syntax tree and ignore formatting. I would suspect that brace placement, tabs vs spaces, etc. would provide a useful fingerprint as well.

breck · on Jan 22, 2015

They used both:

> We used a combination of lexical features (e.g., variable name choices), layout features (e.g., spacing), and syntactic features (i.e., grammatical structure of source code)

The AST stuff is super interesting. The other signals are somewhat superficial. But comparing ASTs? That is deep.

benten10 · on Feb 4, 2015

One would think that, but with AST's generated from 'normal' text (not code), they're actually quite noisy, and lexical and syntactic features have been more useful[1]. In traditional authorship attribution, AST's are a about decade-old technology. But then, this is code, so for all I know, very different.

PS: 1] shows that using AST's does does not get you THAT much of entropy gain compared to other features.

1: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6234420

busterarm · on Jan 22, 2015

As they said, it's only the beginning...

logn · on Jan 22, 2015

Auto-formatters will make that less useful.

fpp · on Jan 22, 2015

They have done a presentation at the 31C3 a few days ago on this presenting their findings in more details & Q/E (http://media.ccc.de/browse/congress/2014/31c3_-_6173_-_en_-_... ).

What I understood from that - it worked quite well with code bases like from the Google Code Jam (large LoC, no style guides etc), but not that well with smaller amounts of code and I'm looking forward to some additional results e.g. with a codebase from a corporate development environment.

cryptos · on Jan 22, 2015

I think that kind of analysis wouldn't be possible with Go (http://golang.org/), since it is very strict/limited and uniform.

jaimebuelta · on Jan 22, 2015

I think that one of the most important parts of this analysis is the naming of variables, so I guess it still be possible (though probably more difficult)

acomjean · on Jan 22, 2015

This doesn't surprise me. When I was in college we had a daily paper I worked at the photo dept. We has a box of "feature Photos". they were kind of filler (campus life, people playing hacky sack, feeding ducks , setting up for events etc..) I figured out one day could look at the photos and tell who took them by the style (the photographer name is on the back).

At one job I had we called uncommented, poorly formatted code "curtis code" for some reason.....

lisa_henderson · on Jan 22, 2015

Of these 3, at least 2 would depend on the language in use:

"We used a combination of lexical features (e.g., variable name choices), layout features (e.g., spacing), and syntactic features (i.e., grammatical structure of source code)"

In particular, "layout features" is a huge issue in some languages, and not at all in others. For instance, a language like Javascript, or PHP, give great flexibility about layout, so in those languages I can see each developer having a unique style (and I have been involved in style debates regarding those languages), however, a language like Python has a fairly fixed layout, since the whitespace is significant. And also, in Clojure, I think most programmers use Emacs and accept the Emacs clojure-mode indenting as the default.

Variable name choices is another where some environments encourage similarity, and others allow for unpredictability and unique styles. Within the Ruby On Rails framework, for instance, there are norms about the creation of variable names.

I would guess that syntactic features is perhaps the one characteristic that shows a great deal of uniqueness in every language. I am often surprised at the choices my fellow co-workers make, when it comes to how to solve a problem.

TheLoneWolfling · on Jan 22, 2015

Python does not have anywhere near a fixed layout.

Off the top of my head:

* When / if single-line if / while / etc statements are used * How many blank lines are used between functions * How often blank lines are used in functions * How much indentation is used for initializing lists / etc. * If multiline strings are used.

Etc.

Some of these are covered by PEPs, yes, but enough people don't follow PEPs religiously that even those offer some information.

dschiptsov · on Jan 22, 2015

Could be identified in some cases of amateur code, like PHP or Javascript or Clojure.

Good practice is to follow a very explicit coding style which makes code written by different developers indistinguishable - the more the better.

Go ahead, identify which developer wrote which part of Linux kernel or, god forbid, jdk/src/

amirmc · on Jan 22, 2015

Something that would be interesting is to follow code styles across people who've pair-programmed. Kinda like the apprenticeship model, I wonder if you could detect specific styles that get adopted and evolve over time.

click170 · on Jan 22, 2015

This is fascinating, my mind is immediately drawn to simple obfuscation programs that would turn tabs into spaces and change the formatting and so on, while still leaving it syntactically correct. Not obfuscation such as to hide the purpose of the code, just the identity of the author.

Does anyone know of any such projects, or what they might be described as? None of the queries I've tried produce the intended results.

You could even take it one step further, if you can identify the author of source code, can you not then forge that signature to make it look like they wrote something they didn't?

userbinator · on Jan 22, 2015

Does anyone know of any such projects, or what they might be described as? None of the queries I've tried produce the intended results.

I guess it's because you're looking for "obfuscators" while such programs are usually known as "automatic code formatters"... and any decent IDE is going to have the functionality to do this.

For something standalone, look at http://en.wikipedia.org/wiki/Indent_(Unix)

palunon · on Jan 22, 2015

Or astyle for more versatile standalone formatter (and it's frequently the one used by IDEs)...

marak830 · on Jan 22, 2015

I have a question, now I've never decompiled anything but I was under the impression it would come out in machine code, so you wouldn't get programmers notes, tabs etc. Can someone explain how their doing this? I knowI should know this Haha, be gentle :-p

geofft · on Jan 22, 2015

Do you test your inputs as soon as the function starts, or where you need them? Or not at all? Do you write lots of little functions, or few big ones? Do you like passing by value or reference when either would work? Do you prefer this wrapper class or that one? Do you use this language feature or that mostly-equivalent one? What are your externally-visible symbols (like API functions) named? How do you report errors,a and what sort of grammar do you use? How much do you log? Do you allocate medium-sized structures on the stack or the heap? Do your for loops count up or down? Do your error-checks on UNIX functions test for == -1 or for < 0? Do your own functions return -1, 0, or something else on error? etc.

AlyssaRowan · on Jan 22, 2015

You still get what they wrote, and most of how they wrote it.

It's hard to describe, but you kind of get the knack of spotting trends after a while. The accuracy's kind of poor, though. There are a lot of programmers in the world; if someone uses and reuses very distinctive routines, fine, but otherwise you're only really going to get a general impression.

You probably won't find Satoshi this way.

jetti · on Jan 22, 2015

You wouldn't get any if that information unless you ran -g with gcc. However, the article uses the source code and doesn't decompile anything.

marak830 · on Jan 22, 2015

Ahh thanks guys I must have missed that part.

newaccountfool · on Jan 22, 2015

Your correct, this would only be usefull with scripting languages such as Python or other. Or code in which you had the source.

ryan-c · on Jan 22, 2015

I'm pretty sure something similar could be done with shell history logs.

bfe · on Jan 22, 2015

This stylometry analysis is 95% of a stylometry obfuscator/homogenizer.

avodonosov · on Jan 22, 2015

The same way authors of text posts online can be identified.

geetarista · on Jan 22, 2015

gofmt ftw ;)

Sanddancer · on Jan 22, 2015

Formatting along those lines are just one vector that's used. This gives a lot of weight to the underlying AST of the code, so in order obfuscate, you'd have to have a program that scrambles variable names, placement and content of control blocks, etc. Basically, you'd need less an autoformatter and more an obfuscator, probably coupled with a deobfuscator to make the code as generic as possible.

xkarga00 · on Jan 22, 2015

Exactly my thoughts

avodonosov · on Jan 22, 2015

Can it help to find the real author of bitcoin?

gwern · on Jan 22, 2015

Hard to say. This is analogous to drug testing or terrorist hunting: even if you have a highly accurate test, you're going to want to apply it to thousands of programmers, and suddenly, when you do the Bayesian calculation, your high accuracy turns out to still be a low probability of having correctly identified the true author.

And then you have to justify your closed-world assumption: how do you know Satoshi (under his real name) was even in your dataset? Maybe after Bitcoin he went back to closed-source work or commercial projects, and none of his source code other than Bitcoin appears in your dataset. Then the guy your analysis picked out isn't 'Satoshi' so much as 'the guy who looks the most like Satoshi (but actually isn't)'.

avodonosov · on Jan 23, 2015

We can find a number suspects, and then analyze other facts about them.

Iv · on Jan 22, 2015

"Prose authorship attribution that utilizes parse trees have been able to identify an anonymous text from 100,000 candidate authors 20% of the time."

Color me unimpressed