Sunday, April 27, 2014

Exploring Expressions of Emotions in GitHub Commit Messages

Anger:
Let's start with anger, an emotion one would expect to be expressed rather frequently in commit messages, as things often go wrong and the process of tracking down bugs and fixing them can be pretty annoying. What stands out in the anger chart compared to the rest is the prominent gap between the "most angry language" VimL and the other languages. Any theories about why this is so?

Regular Expression
(?i)\b(a+rgh|angry|annoyed|annoying|appalled|bitter|cranky|hate|hating|mad)\b'

Joy / Elation:

For similar reasons one can expect anger one can also expect joy. Commits often solve problems, which should make developers happy. While I'm at it, I decided to omit the word "happy" itself. It is used frequently, but often in phrases like "make X happy" or "X is happy", e. g. add readme to make github happy or in negations like tar is not happy on linux - grr, not really indicating a joyful experience.

Regular Expression
(?i)\b(yes|yay|hallelujah|hurray|bingo|amused|cheerful|excited|glad|proud)\b

Amusement:
To detect amusement I heavily relied on Internet slang expressions like "lol" or "rofl" and onomatopoeia like "haha" or "hehe". Looking at the numbers programming cannot be very amusing, but there are probably better ways recognize this emotion.

Regular Expression
(?i)\b(ha(ha)+|he(he)+|lol|rofl|lmfao|lulz|lolz|rotfl|lawl|hilarious)\b

Surprise:
My attempt to detect surprise, was the least satisfying regarding the low numbers, except for sadness which I left out for that reason. There would have been more surprise if I included "wow", but also more "World of Warcraft". Still, I'm a bit surprised of the result, not because Perl is the winner, but because PHP does not seem to surprise people that often, which does not reflect my experience with this language at all.

Regular Expression
(?i)\b(yikes|gosh|baffled|stumped|surprised|shocked)\b

Issues / Bugs:
Every developer knows that bugs creep into code all the time, so it's no surprise that messages about them are by far the most frequent ones compared to everything else in this analysis. Something that becomes even more apparent considering the few words I looked for. A fun fact I have to add: 48.4% of all messages containing "IE", "InternetExplorer" or "Internet Explorer" also satisfy the issue detecting regex below. It's probably so few because I omitted the word problem.

Regular Expression
(?i)\b(bug|fix|issue)|corrected

Swearing (NSFW):
Now to the final and most hilarious part of this analysis: exploring occurrences of swear words, which likely indicate some kind of issue with the code, language, framework or whatever. The chart looks pretty balanced compared to the previous ones and VimL wins again. Assembling a list of swear words is pretty hard. You easily end up with hundreds or even thousands of words and variations, so you somehow need to limit them. One group of words I left out are those referring to genitals, because there is such a wealth of expressions and running some queries indicated that they are used rarely in commit messages.

Regular Expression
(?i)\b(wtf|wth|omfg|hell|ass|bitch|bullshit|bloody|fucking?|shit+y?|crap+y?)\b|\b(fuck|damn|piss|screw|suck)e?d?\b

- More Here

No comments: