Saturday, April 24, 2010

Optimizing JtR's Single Mode Follow Up

Over at the John the Ripper mailing list, (I'm sure you already belong to it right?!), SolarDesigner, the creator of JtR, raised the following question about the re-ordered Single Mode rule-set I released last night.

It is not clear whether you have full (or any) separation between your training and test sets when you re-order the rules. (You do say that you have such separation for your "UnLock" test, but that's another one.) In other words, the improvement from "Original Single Rules" to "Edited Single Version 2" that you've demonstrated might be partially attributable to you training (re-ordering) the rules on the same set that you later test them on.

It's a valid question and it's something I've worried about myself. Referring back to my original post:

For the target set, the RockYou list seemed like an obvious choice. I actually used a subset of the RockYou list of one million passwords I designated for training purposes, (that way I'm not testing/training against the same passwords).

I should have written more about my methodology. Basically the RockYou set represents a huge number of passwords, (over 32 million of them). One of my concerns though has always been over-training my password cracking techniques. As Solar pointed out, you can easily create a highly optimized set of rules that crack one set of passwords extremely well, but performs poorly against everything else. To avoid that, and to get the most use out of the RockYou list, I decided to follow typical machine learning practices and split the list up into one million chunks of passwords. To ensure the same password doesn't end up in different lists, I first randomized the full list by using the GNU shuf tool, and then divided the list into 32 sub-lists containing one million passwords each. I refer to these as the Rockyou1-32 lists. So far I have designated five the sub-lists as test lists, RockYou_test1-5, and five of the sub-lists as training lists, Rockyou_training28-32. I've been assigning them at the end of the spectrum and moving my way to the middle so I don't get confused which one I used for training vs. testing. The remaining lists I'm saving for future tests, so that way I don't "taint" them with my experiments.

Still, since some users had multiple accounts on RockYou due to the way it was set up, it's highly possible, (if not almost certain in some cases), that different passwords from the same user might appear in both a training and a test set. That's also why I would love to get my hands on the original list that included usernames so I can make sure that this doesn't happen. Since there is bound to be some "cross-pollination" though, it's a very valid concern that tests trained on one of the RockYou training sets, and tested against one of the RockYou test sets contain unfair knowledge and don't accurately represent real password cracking attacks.

All the experiments I've run so far have indicated that the above isn't a major problem, but rather than take my word for it, Fig 4.1 shows another round of tests comparing the original single rule-set vs. the reordered rule-set I released the other night. The input dictionary remains the same, (dic-0294), but this time they are both attacking the phpbb.com list.

Figure 4.1:

In the above test, the rearranged version on the Single rule-set performed better, but this time to a much lesser degree. I then ran the both of the attacks against 33k passwords from the MySpace password testing list, (considering the MySpace list was the first set of real passwords I collected, it also was split into a training and test list each containing around 33K passwords). The results of those cracking sessions can be seen in Fig 4.2.

Figure 4.2:

This time the results are much more dramatic over the first 500 million guesses, with the modified ruleset performing significantly better. So now we've seen against three different sets of English website passwords, (the RockYou_test1 list, the PhpBB.com list, and the MySpace test list), that the re-ordering of the single mode mangling rules cracked as many, if not more passwords over the first 500 million guesses compared to the original rule-set.

If you are interested in downloading the modified rule-set, you can still grab it here. To use it, just enter the command line option "rules=Modified_Single".

Thursday, April 22, 2010

Optimizing John the Ripper's "Single" Mode for Dictionary Attacks

While I've been doing a lot of analysis, I figure it's been a while since I actually released anything. That obviously needs to change. As one small step in the right direction, I decided to optimize John the Ripper's "Single" mode word mangling rules for use in normal dictionary based attacks. If you don't want to read through the rest of this post on my methodology, you can grab the new rule-set right here. To make use of it in a cracking session, simply enter the flag:

-rules=Modified_Single

For a more detailed explanation on what I did, please read on.

The Problem:

First of all, did you know that starting with John the Ripper version 1.7.4 you can have multiple rulesets in the same john.conf config file? Also SolarDesigner added a several new mangling rules, (such as the ability to insert/append whole strings), and increased the speed at which the mangling rules generate guesses. I know the current 1.7.5 branch is still not considered the stable version, but if you are using John the Ripper, I highly recommend upgrading to it.

Perhaps the biggest change from a research perspective was that the single mode ruleset can now be used with normal dictionary based attacks. Let me back up and quickly summarize John the Ripper's three default modes:
  1. Single Mode: Originally designed to only work with the GECOS field information, (mostly just the username + full name of the individuals who the password hashes belong to); Think of it as a dictionary attack with a large number of word mangling rules but a very small dictionary. This is the default first attack that John the Ripper runs.
  2. WordList Mode: Your typical dictionary based attack. The default John the Ripper mangling rules were designed to finish very quickly, so they are highly optimized, but only produce a limited number of guesses, (on average around 40 guesses a word -not a scientific number, but more of a general guesstimate). This is the default second attack that John the Ripper runs.
  3. Incremental Mode: An "intelligent" brute force attack. What John the Ripper falls back on if nothing else works.
One problem I've always had in my research was finding an "honest" comparison when evaluating dictionary based attacks. In the professional password cracking industry, people are very reluctant to disclose their custom mangling rules. That pretty much left me with just John the Ripper's default wordlist mode. The problem was, even with a larger input dictionary, the default rules don't generate very many guesses, (For example, there is no rule that will try all two digit numbers at the end of a password guess). This made comparisons against longer cracking sessions quite difficult since I didn't have anything to compare my methods against. While I could always design a longer set of mangling rules myself, that becomes a little "iffy" from a research perspective. Aka how do people know that I put forth an honest effort to make a fair set of mangling rules vs intentionally creating mangling rules that perform horribly in order to make my own cracking techniques look better. Single mode helps solve this problem since now I can model longer password cracking sessions against an "industry standard attack".

This also helps a majority of the users of John the Ripper who don't want to create their own word mangling rules. Since for simple password hashes such as MD5, the time it takes for the default wordlist mode to finish generally is measured in minutes, (if not seconds), depending on the input dictionary, the ability to use single mode rules allows people to run more extensive dictionary based attacks before they are forced to fall back to incremental mode.

The problem is that the single mode ruleset was designed to work with usernames, not dictionary words. This means the order in which it tries guesses is not tailored to most input dictionaries. As an example of that check out Fig 1.1 of a password cracking session using the single mode rule-set and the dic-0294 input dictionary, (around 860k words), run against a list of one million passwords selected from the RockYou dataset divided by minimum password length. Note: The cracking session was limited to 1 billion guesses.

Figure 1.1:


BTW, you'll probably be seeing this graph in a later post as well...

The thing that really stands out is that the cracking session looks like a series of steps. What this means is that some effective mangling rules are not being tried until later in the password cracking session. While this still isn't a huge deal if fast hashes like MD5 are being attacked, it does become an issue if more computationally expensive hashes are targeted. In short, you want to front-load your cracking session to crack as many passwords as quickly as possible.

The Solution:

Basically, I set out a modest goal for myself. Reorder the Single mode rules to make for a more optimized dictionary based cracking attack.

The first problem was identifying how effective each mangling rule was. To do this I inserted a new mangling rule after each mangling rule in JtR's single ruleset. That rule is as follows:

A0"THISISABENCHMARK!!"'I

All it does is delete whatever word that was supposed to be printed from the input dictionary and outputs "THISISABENCHMARK!!" That way I can easily tell when a new rule is started. Admittedly, depending on the size of the dictionary, several million of these debug lines may be printed but that's easy to deal with.

After that is was simple a matter of modifying my verification program to take input from John the Ripper, and keep track of 1) How many passwords each rule cracked, and 2) How many guesses each rule generated. This lead to a passwords cracked per guess metric:
passwords_cracked_per_guess = (#_of_password_cracked) / (#_of_guesses)
In the future I'll probably normalize the equation so the metric is independent of the number of passwords in the training set being attacked. For now though it doesn't really matter. One other minor point: I reset the password list with each new rule. That way a single rule won't mark the same password being cracked twice, but if two different rules would crack a password, they both get credit for it.

After that, the next question was which input dictionaries and password sets should I tailor it to. As you can imagine, both of those choices can make a fairly large impact on which rules are considered the most effective. For the target set, the RockYou list seemed like an obvious choice. I actually used a subset of the RockYou list of one million passwords I designated for training purposes, (that way I'm not testing/training against the same passwords). For the input dictionary I was very tempted to use one of my custom ones, but I actually settled on using dic-0294 instead since that's readily available and most likely represents the type of dictionary a majority of people are going to use.

After that it was pretty straightforward. Run my cracking session against the training set, evaluate how effective each rule was, and then re-order the rules. The results of using the reordered rules when cracking one million test passwords, (different from the training passwords), from the RockYou list can be seen in Fig 2.1.

Figure 2.1:

So as the results show, while the modified version of the single rules starts out slightly better, the original version cracks more passwords in the first one billion guesses .... Wait, What!! That wasn't supposed to happen.

While eventually both rule-sets will crack the exact same number of passwords, I certainly expected the modified version to perform much better over a majority of the cracking session. At first I thought it might be some differences in the training/test password sets, but quickly discarded that idea as one million passwords should be enough for the law of large numbers to come into effect. Looking into it more I quickly discovered that the problem was with my replacing cracked passwords between each different rule when calculating the passwords cracked per guess. The way that the rules and the input dictionary interacted creates a huge number of duplicate guesses. For example, both 'password', and 'Password' are in dic-0294. Therefore the guesses 'Password', 'Password1', 'Password2', etc are created twice, once when trying the words with no capitalization, and once when the first letter is capitalized.

Duplicate guesses are a fact of life when performing most dictionary attacks. While there are ways to minimize them, the benefits of having a pre-mangled input dictionary can sometimes be quite high since it can allow you to crack passwords with context sensitive mangling rules, such as 'passWord', much faster. The reason I'm writing about the above test, (vs. just sweeping it under the rug), is I feel even a failed test can provide a lot, if not more, information than a successful test. That's why I love science.

So the next step was to re-run my calculations, but this time without performing replacement of cracked passwords between different rules. Ideally I should have done this several times, (re-ordering the rules, and recalculating their effectiveness), until a local maximum was reached, but I was lazy and just did it once. The results of running these re-ordered rules andt the original single rule-set against the RockYou test list can be seen in Fig 2.2:

Figure 2.2:

OK, this time the results turned out how I expected it to, with the modified rule-set performing as well as or better than the original single rules. That being said, I think this highlights how much credit should go to SolarDesigner for his work creating the original Single rule-set, since the improvement, while noticeable in the first 500 million guesses, was not overly dramatic. To improve the cracking effectiveness even more, changes would need to be made to the individual rules, vs. just re-ordering them.

As a preview of the probabilistic password cracker I've been working on, currently called UnLock (though I'm desperately looking for a new name since that sounds a bit presumptuous), Fig 3.1 shows it attacking the RockYou test list using the same input dictionary dic-0294. Also, since UnLock allows you create rules by training it on disclosed password lists, I trained this attack on the MySpace password set, which actually wasn't ideal since MySpace required users to include one non-lowercase letter in their password.

Figure 3.1:

Finally, I'll talk about this more later when I write a post on using modified edit distance to evaluate input dictionaries, but the estimated maximum number of passwords possible for Dic-0294 to crack when compared against the RockYou training list was 60.59%. An attacker would achieve that if they tried every combination of case-mangling rules, common letter replacements, combining two words, and appending/pre-pending numbers + special characters. Aka '7123PaS$w0rd$@12' would be included, even though no real life cracking session would likely every create that guess.

Monday, April 5, 2010

State of the Blog: April Edition

*Comic courtesy of http://www.phdcomics.com/

Well, it looks like after three years of work, I did it. I'm still putting some last minute touches on my dissertation but once that's finalized I'll post a copy. The crazy thing is that after going through all of that, I'm actually more motivated to do research.

So that leads us to this blog. Don't worry, it's not going away. In fact one of my prerequisites for any new job I get is that I need to be allowed to keep on updating here. As far as posts go, I'm going to be shifting away from brute force attacks and start talking about dictionary attacks instead. I know, I've maintained this blog for over a year and I'm only getting to that now...

Let me explain:

1) I'm lazy

2) My main area of study has been designing new ways to represent how people create passwords using probabilistic context free grammars. At it's heart this approach is an improvement to standard dictionary based attacks, though I'm currently expanding it to incorporate brute force attacks as well. What it really is though is a fundamentally different way of looking at a password cracking attack compared to traditional rule based methods. This has lead to a bit of a problem when talking about traditional dictionary attacks though. Creating a new 10,000 rule John the Ripper config file just doesn't hold that much appeal to me -ed note: yes it does. Ok, what I should say is that it's hard for me to put my full heart into that when all my experiences using probabilistic grammars have been so superior. The issue though is that I've been struggling with a way to convey why they're useful beyond the fact that they can crack more passwords with fewer guesses than traditional methods. Hmm, actually that might work...

Probabilistic Context Free Grammars: More passwords cracked - Fewer Guesses

You can see an older version of it here, though currently that's about a year out of date. Refer to point #1 above. What I really need to do is clean up a newer version and post it online. Right now it's full of debugging code, the user interface is almost non-existent, and the case-mangling algorithm looks like it was written by the lead developer of Windows ME. But it does crack passwords.

I expect the next couple posts will probably follow along the lines of:
  1. A survey of existing dictionary based attacks available in current password cracking tools
  2. Optimizing JtR's single mode rules for longer password cracking sessions
  3. Using Edit Distance to reverse mangle passwords
  4. The limits of rule based attacks - Alt title: 2 million rule config files are a real pain
  5. Releasing a new version of my JtR config generator - Only one year after I said I would. Refer to point #1 above
  6. Probabilistic password cracking
That doesn't include all of the other topics I might post about depending on my mood. For example, I really tempted to do a detailed write-up of all the cloud based password cracking options available. Interesting times.