Sunday, January 24, 2010

Analysis of 10k Hotmail Passwords Part 6: Markov Model Showdown 2 - The Rematch

I know, soon my titles will get so long I won't be able to fit them into a Twitter post. It was all I could do to leave off a tagline such as "Revenge of the Incremental Mode."

I just received an e-mail from SolarDesigner, the creator of John the Ripper, who promptly set me straight on a few points about how the incremental attack works. I'm going to break down the e-mail into two parts. The first part is as follows:

Matt,
In your very interesting blog post at:
http://reusablesec.blogspot.com/2009/11/analysis-of-10k-hotmail-passwords-part.html

you made some incorrect statements/guesses about the incremental mode:

"Unfortunatly it doesn't take into account the previous trigraphs that appeared before it,"

Not true.

"(except when calculating the overall probability)."

No idea what you mean here.

"For example, if the first trigraph is "auq", the next trigraph's probability isn't increased if it starts with a 'u',"

Actually, if the first trigraph is "auq", then as far as inc.c is concerned the next trigraph starts with "uq", and there's one letter to be determined.

"(even though 'u' almost always follows 'q')."

If "u" commonly follows "uq" at that character position in passwords of that length in the training set, then it will be tried by JtR early on.

"Therefore, the incremental attack doesn't measure the conditional probability between the third and fourth letter, and the sixth and seventh letter of a password guess."

That's not true. It treats trigraphs at all positions in the same way. It also has special handling for the first and the second character position, before the first trigraph can be formed, but then it just proceeds with overlapping trigraphs till the full length is reached.

"The biggest advantage of using the incremental attack is that it creates highly optimized guesses while still retaining the ability to eventually cover the entire key-space if you give it enough time."

Right, and that's probably the cause of the difference you're seeing as well. This ability is achieved by having the same number of character indices possible for each character position. This works great early on, but it might be suboptimal after a while - where it might make sense to try, say, 50 different character indices for a certain position, but only 10 for another one. This is cured to a limited extent with the "cracking order" table (which includes not only the length and the character count, but also the "fixed character index" position number), but perhaps further improvement in this area is desired.

I think you could hack inc.c such that you'd achieve results similar to those of the Markov mode. Specifically, drop expand() and adjust inc_key_loop() to remove the assumption that characters exist for all indices. It'd be tricky to skip those incomplete-info indices quickly, though - but in terms of the first 1G guesses produced (if you don't care how much time it'd take to produce those), I think you'd achieve results very similar to Markov's.

I'm very happy I was corrected about this, and I have modified my previous post to be less wrong. The original copy of my post can still be found in the comments since I do occasionally try to practice some journalistic ethics. Also, if I ever say anything else that doesn't match up with reality, please let me know. I'd rather be shown to be wrong than to remain ignorant. That's why I have a blog with comments and an e-mail address on the side ;)

Now onto the second portion of SolarDesigner's e-mail:
Another related thought I had:
When you compare Incremental vs. Markov (or the like), you could want to run through --single and at least a small wordlist such as password.lst with rules first. That would reflect the real-world case those modes need to be optimized for. It's certainly what I was optimizing for. I had slightly different revisions of the incremental mode early on that would happen to crack more passwords sooner when run with empty john.pot, but would work worse after single and wordlist w/rules; I rejected those revisions (never released).

I don't expect this change in your testing to affect your results in a major way (I pointed out another likely cause for the difference you've observed above), yet this is something for you to try next time (for both modes, indeed).

Similarly, when you compare different wordlists and different wordlist rulesets, it makes sense to do so after having run --single first (keep and reuse the same john.pot with after-single results). Again, this is what I had in mind when experimenting with the default wordlist ruleset (over 10 years ago).
That's a good point, which of course leads us to more tests:

Test 9: Testing JtR's Markov and Incremental Modes After a Simple Dictionary Attack

Unfortunately I'm not going to run --single mode first for this test. As a quick explanation, single mode uses the username/e-mail addresses/related information as a very small targeted input dictionary and then runs a large number of advanced word mangling rules against the password set. By default, this is the first attack that John the Ripper runs during a password cracking session. The reason I'm not going to run it is that I don't store the user information with the password sets I analyze. This isn't because of some secret l33t password cracking technique, but instead due to the fact that I'm trying to remain ethical and respect the privacy of the people in these lists. While I will look at username/password info initially when I get a new list so I can get a better handle on how the list is structured, I separate the info as soon as possible. That's because I don't want to know that bob@yahoo.com's password is 'Ihatemyboss'. In short, I'd rather penalize my cracking techniques than deal with those issues. I might grab the original list and run a -single attack in the future since as SolarDesigner points out, that is very useful information, but right now it's just too much of a hassle, which is the way I meant it to be.

So that disclaimer aside, the next question is how I should run my dictionary attack. To keep things simple, for this first test I used the default John the Ripper ruleset and the default passwords.lst wordlist. With that dictionary/rule combo, I ended up cracking 512 passwords from the Hotmail dataset using just 139k guesses. What can I say? It's a small input dictionary and JtR's default rules are highly targeted, (aka they are designed to finish quickly). Also, most of the Hotmail passwords are Spanish/Portuguese, and passwords.lst is and English dictionary. I'll get more into that once I finally write that post I've been promising you about dictionary attacks on this list. Just for comparison, below is a graph representing the passwords cracked using a dictionary attack, vs an incremental attack and a Markov attack as detailed in Test 8 from my previous post. Aka, both of the brute force attacks were trained on the Phpbb password set, and the Markov attack used a limit of 215 with a maximum of 8 characters.

Results Part 1:


Analysis Part 1: It shouldn't come as a surprise that the dictionary attack performed better. As I stated previously there were a lot of factors that kept it from performing as well as it would against an English based password list, (the main one being it helps to use a dictionary targeted against the language of the set. It's a complicated concept I know). That being said, notice the sharp drop off in passwords cracked. That shows the dictionary attack was able to crack a lot of passwords using no mangling rules, or just adding '1' to the end of the guess. That's partially an artifact of the dictionary, as it has many common password pre-mangled in it, such as "abc123", and the fact that many people didn't use any mangling rules to create their passwords. That's actually one of the nice things about the dictionary passwords.lst since having pre-mangled passwords in the dictionary helps the attacker crack the easiest passwords quicker. Oh, and for the Markov mode, no I didn't forget to add it to the graph. It only cracked 4 passwords in the first 140k guesses since I didn't re-target it for the smaller cracking session. As I said before, depending on how you set the limit size it takes a while to catch up to other cracking methods.

Results Part 2: Ok, so after running the dictionary attack, I then re-ran the Markov and Incremental attacks as described previously on the uncracked passwords. Here are the results - Note, I attributed the 512 passwords previously cracked to both of them:


Analysis Part 2: Well, the results are pretty much the same. This is to be expected since the dictionary attack only cracked 512 passwords, (aka the largest possible improvement would occur if all of those passwords were guesses that Markov mode tried, but Incremental mode didn't. Even that would only help Incremental mode catch up by 512 passwords). In fact, most of the passwords that were cracked in the dictionary attack would also have been cracked in the Markov/Incremental attacks given 1 billion guesses. By this I mean the Dictionary attack only cracked 28 passwords the Markov attack didn't, and 4 passwords the Incremental mode didn't. Please note, this isn't a knock against dictionary based attacks. What a dictionary attack lets you do is crack those passwords much, much sooner, which is the reason why I included the previous graph in part 1. That leads us to the next test...

Test 10: Testing JtR's Markov and Incremental Modes After a Longer Dictionary Attack

Everything pretty much stays the same as the previous test, except that I ran the hotmail list through multiple dictionaries first using John the Ripper's default rule sets. The dictionaries include the previously mentioned passwords.lst, along with dic-0294, the Spanish wordlist off of John the Ripper's ftp site, and a Spanish and Portuguese wordlist gathered off of wiktionary, (this isn't Sebastian Raveou's one though I'll get to a specialized dictionary he provided me in a later post).

Results: I'm not going to include a graph of the dictionary attack since I was lazy and didn't bother to merge the dictionaries and remove duplicate words, (I just ran each one through independently). All together they cracked 2,104 passwords from the Hotmail set. Once that was done I re-ran the Markov and Incremental mode attacks against the list. Here are the results:

Analysis: Once again, the results look pretty much the same, though this time the dictionary attack cracked many more passwords that were not cracked during the brute force attacks. In fact, this run raised the success rate of the dictionary attack + Markov mode to slightly more than 50% of the total set cracked.

Coming Up: I know interest is mostly focused on the Rockyou list right now, but I do want to continue with the Hotmail list simply because combined with all the previous analysis, I think we're getting a better view of the effectiveness of different password cracking techniques. Aka, when I finally get around to doing a write-up on dictionary based attacks you will be able to compare it to all the previous brute force methods I've looked at. Don't worry, I'll still cover the Rockyou list. It's just that I might be switching between the two lists for the next couple of posts.

No comments: