Optimizing John the Ripper's "Single" Mode for Dictionary Attacks
While I've been doing a lot of analysis, I figure it's been a while since I actually released anything. That obviously needs to change. As one small step in the right direction, I decided to optimize John the Ripper's "Single" mode word mangling rules for use in normal dictionary based attacks. If you don't want to read through the rest of this post on my methodology, you can grab the new rule-set right here. To make use of it in a cracking session, simply enter the flag:
-rules=Modified_Single
For a more detailed explanation on what I did, please read on.
The Problem:
First of all, did you know that starting with John the Ripper version 1.7.4 you can have multiple rulesets in the same john.conf config file? Also SolarDesigner added a several new mangling rules, (such as the ability to insert/append whole strings), and increased the speed at which the mangling rules generate guesses. I know the current 1.7.5 branch is still not considered the stable version, but if you are using John the Ripper, I highly recommend upgrading to it.
Perhaps the biggest change from a research perspective was that the single mode ruleset can now be used with normal dictionary based attacks. Let me back up and quickly summarize John the Ripper's three default modes:
- Single Mode: Originally designed to only work with the GECOS field information, (mostly just the username + full name of the individuals who the password hashes belong to); Think of it as a dictionary attack with a large number of word mangling rules but a very small dictionary. This is the default first attack that John the Ripper runs.
- WordList Mode: Your typical dictionary based attack. The default John the Ripper mangling rules were designed to finish very quickly, so they are highly optimized, but only produce a limited number of guesses, (on average around 40 guesses a word -not a scientific number, but more of a general guesstimate). This is the default second attack that John the Ripper runs.
- Incremental Mode: An "intelligent" brute force attack. What John the Ripper falls back on if nothing else works.
One problem I've always had in my research was finding an "honest" comparison when evaluating dictionary based attacks. In the professional password cracking industry, people are very reluctant to disclose their custom mangling rules. That pretty much left me with just John the Ripper's default wordlist mode. The problem was, even with a larger input dictionary, the default rules don't generate very many guesses, (For example, there is no rule that will try all two digit numbers at the end of a password guess). This made comparisons against longer cracking sessions quite difficult since I didn't have anything to compare my methods against. While I could always design a longer set of mangling rules myself, that becomes a little "iffy" from a research perspective. Aka how do people know that I put forth an honest effort to make a fair set of mangling rules vs intentionally creating mangling rules that perform horribly in order to make my own cracking techniques look better. Single mode helps solve this problem since now I can model longer password cracking sessions against an "industry standard attack".
This also helps a majority of the users of John the Ripper who don't want to create their own word mangling rules. Since for simple password hashes such as MD5, the time it takes for the default wordlist mode to finish generally is measured in minutes, (if not seconds), depending on the input dictionary, the ability to use single mode rules allows people to run more extensive dictionary based attacks before they are forced to fall back to incremental mode.
The problem is that the single mode ruleset was designed to work with usernames, not dictionary words. This means the order in which it tries guesses is not tailored to most input dictionaries. As an example of that check out Fig 1.1 of a password cracking session using the single mode rule-set and the dic-0294 input dictionary, (around 860k words), run against a list of one million passwords selected from the RockYou dataset divided by minimum password length. Note: The cracking session was limited to 1 billion guesses.
Figure 1.1:
BTW, you'll probably be seeing this graph in a later post as well...
The thing that really stands out is that the cracking session looks like a series of steps. What this means is that some effective mangling rules are not being tried until later in the password cracking session. While this still isn't a huge deal if fast hashes like MD5 are being attacked, it does become an issue if more computationally expensive hashes are targeted. In short, you want to front-load your cracking session to crack as many passwords as quickly as possible.
The Solution:
Basically, I set out a modest goal for myself. Reorder the Single mode rules to make for a more optimized dictionary based cracking attack.
The first problem was identifying how effective each mangling rule was. To do this I inserted a new mangling rule after each mangling rule in JtR's single ruleset. That rule is as follows:
A0"THISISABENCHMARK!!"'I
All it does is delete whatever word that was supposed to be printed from the input dictionary and outputs "THISISABENCHMARK!!" That way I can easily tell when a new rule is started. Admittedly, depending on the size of the dictionary, several million of these debug lines may be printed but that's easy to deal with.
After that is was simple a matter of modifying my verification program to take input from John the Ripper, and keep track of 1) How many passwords each rule cracked, and 2) How many guesses each rule generated. This lead to a passwords cracked per guess metric:
passwords_cracked_per_guess = (#_of_password_cracked) / (#_of_guesses)
In the future I'll probably normalize the equation so the metric is independent of the number of passwords in the training set being attacked. For now though it doesn't really matter. One other minor point: I reset the password list with each new rule. That way a single rule won't mark the same password being cracked twice, but if two different rules would crack a password, they both get credit for it.
After that, the next question was which input dictionaries and password sets should I tailor it to. As you can imagine, both of those choices can make a fairly large impact on which rules are considered the most effective. For the target set, the RockYou list seemed like an obvious choice. I actually used a subset of the RockYou list of one million passwords I designated for training purposes, (that way I'm not testing/training against the same passwords). For the input dictionary I was very tempted to use one of my custom ones, but I actually settled on using dic-0294 instead since that's readily available and most likely represents the type of dictionary a majority of people are going to use.
After that it was pretty straightforward. Run my cracking session against the training set, evaluate how effective each rule was, and then re-order the rules. The results of using the reordered rules when cracking one million test passwords, (different from the training passwords), from the RockYou list can be seen in Fig 2.1.
Figure 2.1:
So as the results show, while the modified version of the single rules starts out slightly better, the original version cracks more passwords in the first one billion guesses .... Wait, What!! That wasn't supposed to happen.
While eventually both rule-sets will crack the exact same number of passwords, I certainly expected the modified version to perform much better over a majority of the cracking session. At first I thought it might be some differences in the training/test password sets, but quickly discarded that idea as one million passwords should be enough for the law of large numbers to come into effect. Looking into it more I quickly discovered that the problem was with my replacing cracked passwords between each different rule when calculating the passwords cracked per guess. The way that the rules and the input dictionary interacted creates a huge number of duplicate guesses. For example, both 'password', and 'Password' are in dic-0294. Therefore the guesses 'Password', 'Password1', 'Password2', etc are created twice, once when trying the words with no capitalization, and once when the first letter is capitalized.
Duplicate guesses are a fact of life when performing most dictionary attacks. While there are ways to minimize them, the benefits of having a pre-mangled input dictionary can sometimes be quite high since it can allow you to crack passwords with context sensitive mangling rules, such as 'passWord', much faster. The reason I'm writing about the above test, (vs. just sweeping it under the rug), is I feel even a failed test can provide a lot, if not more, information than a successful test. That's why I love science.
So the next step was to re-run my calculations, but this time without performing replacement of cracked passwords between different rules. Ideally I should have done this several times, (re-ordering the rules, and recalculating their effectiveness), until a local maximum was reached, but I was lazy and just did it once. The results of running these re-ordered rules andt the original single rule-set against the RockYou test list can be seen in Fig 2.2:
Figure 2.2:
OK, this time the results turned out how I expected it to, with the modified rule-set performing as well as or better than the original single rules. That being said, I think this highlights how much credit should go to SolarDesigner for his work creating the original Single rule-set, since the improvement, while noticeable in the first 500 million guesses, was not overly dramatic. To improve the cracking effectiveness even more, changes would need to be made to the individual rules, vs. just re-ordering them.
As a preview of the probabilistic password cracker I've been working on, currently called UnLock (though I'm desperately looking for a new name since that sounds a bit presumptuous), Fig 3.1 shows it attacking the RockYou test list using the same input dictionary dic-0294. Also, since UnLock allows you create rules by training it on disclosed password lists, I trained this attack on the MySpace password set, which actually wasn't ideal since MySpace required users to include one non-lowercase letter in their password.
Figure 3.1:
Finally, I'll talk about this more later when I write a post on using modified edit distance to evaluate input dictionaries, but the estimated maximum number of passwords possible for Dic-0294 to crack when compared against the RockYou training list was 60.59%. An attacker would achieve that if they tried every combination of case-mangling rules, common letter replacements, combining two words, and appending/pre-pending numbers + special characters. Aka '7123PaS$w0rd$@12' would be included, even though no real life cracking session would likely every create that guess.
Comments