I've twice gotten into heated debates, (and I remember the count since I still find it weird that I've gotten into arguments about this), that there is no universal letter frequency probability values for the English language. Both times I stumbled into this argument by answering the question, "So, do passwords match letter frequency analysis of the English language as a whole", by replying, "Well, it depends on what training set you use to calculate the probabilities of the English language, but for the most part, yes". In reality I should have just said, "For the most part yes..."
Letter frequency analysis, and its big brother Markov models, depend entirely on their training sets. This means while different training sets may share common characteristics, there is no one LFA, or Markov set of probabilities that perfectly model the English, (or any other), language. But what about those tables Wikipedia posted? Well, they are based on a study in 1948 by Herbert Zim, who analyzed several American books like Moby Dick, along with a collection of poetry. It was a fairly impressive effort considering he didn't use a computer so I really feel sorry for his graduate students at the time. Reading Moby Dick is hard enough without having to keep count of the number of times the letter "i" is used.
For whatever reason, Moby Dick is still one of the most popular training sets of the English language today. For examples of that, check out these intro to cryptography lecture slides from both Stanford and Colombia University. The question is, does Moby Dick represent the English language as a whole? Well if you compare it to Twitter posts, you are probably going to notice some differences. I may be out on a limb here, but I'm going to guess that Melville didn't use as many emoticons and tinyurl's as your average Twitter user ;)
The next response then is to collect as many examples of English literature as possible and analyze the entire set as a whole. Poetry, novels, wikipedia posts, Webster's dictionary, web pages - they all go into the pot. The problem then is that the target set you are really interested in might get buried under all other chaff. On a side note concerning that last link: the ladders must be an attractive site to hackers. A huge collection of personal information, including passwords, resumes, job history, e-mail addresses, etc, of people who are earning 100k, and are slightly gullible. No, that's not worth anything....
So finally to the reason of this post. The same caveat that there is no universal best set of probabilities to use also applies to password cracking. This means there is no .chr probability file for John the Ripper's incremental mode, (or .stat file for Markov mode), that is the best for all situations. That's why it's great that Minga released a custom .chr file trained on the RockYou password set. The original post he made on the John the Ripper mailing list can be found here, and the most updated copy of the .chr file can be downloaded from his website here. John the Ripper's default incremental .chr file was originally trained on mostly Unix passwords. What Minga's custom .chr file does is provide us with a probability set that is targeted against web site users. The question then becomes, how effective is using the RockYou .chr file against other website passwords? Oh, boy it's time to make some new graphs ;)
Test 2.1: Testing various .chr files using John the Ripper's Incremental mode against a subset of the RockYou password list
Full disclaimer: This test is not fair! You almost never want to train/test against the same dataset. It's like stealing the teacher's answer key the night before a big test. That being said, I was interested in just how much of an improvement an attacker would receive by training and testing against the same set. Also, Minga actually created two .chr files. The first one was not trained on duplicate passwords, (aka '123456' was only counted once). The second set, (which is the one now available for download), did count duplicate passwords, (so '123456' appeared several thousand times in the training set). That can have a dramatic impact on the resulting probabilities. I figured this test might demonstrate which approach is better and if so, by how much.
As far as the actual test construction goes, I selected a random subset, (using the GNU shuf command), of one million passwords. This was to add some randomness into the test set so it wasn't an exact duplicate of the full RockYou list. I then ran the original non-duplicate Rockyou list, (Minga_RockYou1), the second RockYou list trained on duplicates, (Minga_RockYou2), John the Ripper's default All incremental mode, (Default_All), and a training set based on passwords from the Phpbb list, (Phpbb) for one billion guesses each to see how many passwords they cracked. I selected the 'All' subset for the default attack even though "Alnum" would perform better over the first billion guesses simply because the RockYou and Phpbb .chr files included uppercase, and special characters. If I find the time, I might create some special .chr files using those datasets that only include lowercase letters and numbers in the future and then re-run these tests.
Analysis: As expected, the second publicly available Rockyou .chr file available from Minga performed the best. I decided to to add some additional data to the graph since total passwords cracked tends to leave out the fact that as a password cracking session goes on, it takes many, more guesses to crack each additional password. Aka, cracking 42% of the test set represents a much larger accomplishment than cracking 40% of the set. To that end, I added vertical lines representing when each of the other modes cracked as many passwords as JtR's default mode. As you can see, the PhpBB trained attack cracked the same number of passwords with less than half of the guesses the Default mode required.
Test 2.2: Testing various .chr files using John the Ripper's Incremental mode against the PhpBB.com list
Same attack setup as last time, but now we are attacking the phpbb.com list that was released about a year ago. Once again, since the phpbb .chr file was trained on this list, it's not a fair test, but at least we can see how the RockYou trained .chr file fares against JtR's Default .chr file. Also, I'm only using the more recent RockYou list since it performs better than the first version.
Analysis: As expected, the phpbb training did the best. What's notable though was that the RockYou trained attack also performed significantly better than John the Ripper's default probabilities. Combined with the previous test, this implies that the passwords from the two different sets share a lot in common, and that they also differ from Unix passwords. This leads us to the next test...
Test 2.3: Testing various .chr files using John the Ripper's Incremental mode against the Faithwriter's list
Faithwriter's was a Christian website that stored its passwords in plain text, and was broken into and publicly posted online by the 4chan Anonymous crowd. Let's see how it does using the same three training sets from the previous test.
Analysis: Wow, it looks like web-based passwords do share a lot in common. This was the first fair test where the test set was completely separate from all three training sets. Once again, the Phpbb, and RockYou trained attacks continued to outperform the default JtR trained attack. It's interesting to note that the RockYou set did the best, so there are differences between web-using groups. I would imagine that the Phpbb users tended to skew older and more computer literate, (it was a development website for the Phpbb forum software), than the RockYou users. This isn't a knock against the FaithWriter's users, but I'm just pointing out that bloggers, (from the RockYou set), and writers, (from the FaithWriters set), might share similar characteristics.
Test 2.4: Testing various .chr files using John the Ripper's Incremental mode against the MySpace list
To end this post up, I figured I might as well run the attacks against the grandfather of all disclosed password lists, the MySpace list that showed up a few years ago. This list was the result of a rather clever phishing attack where the attacker failed to password protect their drop-box, (ironic, I know). What makes this list special is that it was one of the first publicly disclosed password lists and you still see it mentioned a lot in password analysis papers. More importantly, since the RockYou list included MySpace users, this seems like the very definition of an overlapping user-base. Aka, if you are attacking a MySpace user, how much of an improvement should you expect to get using the RockYou probabilities, vs. the Default or Phpbb probabilities?
Analysis: At first glance it may appear that all of the training sets did poorly, but it's important to note that MySpace has a password creation policy, (at the time of the attack, passwords had to be at least six characters long and have one non-lowercase alpha symbol), and there is a lot of non-passwords in the list due to people telling the phisher what they really thought about him/her. Therefor you wouldn't expect a brute-force style attack to perform as well. As you can see though, the RockYou trained attack performed significantly better than both of the other attacks.
Conclusion/Notes: What I really need to do is run some additional tests since I feel that the current slate of tests above unfairly shows the default John the Ripper probabilities in a bad light. What would be nice would be to run the same attacks against passwords from computer log-ins, (both Windows and Unix/Linux). I would expect the default John the Ripper .chr files to perform much better in that case. The problem is I don't have very many passwords like that due to the nature of my research, (I collect passwords that are disclosed publicly online). Basically I have more MD5's, Sha1, Phpbb3, and Vbulletin hashes than I know what to do with, a couple of LanMan, (old Windows Login hashes), and a very small scattering of other password hashes, (NTLM, Cisco, Crypt3, etc). I've been archiving posts from hashcracking forums so the next step is to sift through them for examples of other log-ins. They won't be part of some nice data-set, but it will give us a better idea of how people create passwords for a computer log-in vs. a website log-in.
In conclusion, if you are auditing website passwords, you should probably use the .chr file Minga provided when you run an incremental attack using John the Ripper.