Wednesday, August 19, 2015

Challenges with Evaluating Password Cracking Algorithms

"In theory, theory and practice are the same. In practice they are not" -Quote from somebody on the internet. Also attributed to Albert Einstein but I've never been able to find the original source to back that up.


Currently I'm writing a post looking into Hashcat's Markov mode but I found myself starting off by including several paragraphs worth of disclosures and caveats. Or to put it another way:

It was a valid point Jeremi brought up and it's something I'm trying to avoid. After thinking about it for a bit I figured this topic was worth its own post.

Precision vs Recall: 

Part of the challenge I'm dealing with is I'm performing experiments vs writing tutorials. That's not to say I won't write tutorials in the future but designing and running tests to evaluate algorithms is fun and what I'm interested in right now. Why this can be a problem though is that I can get so deep into how an algorithm works that it's easy to loose sight of how it performs in a real life cracking session. I try to be aware of this, but an additional challenge is representing these investigations to everyone else in a way that isn't misleading.

This gets into the larger issue of balancing precision and recall. In a password cracking context, precision is modeling how effective each guess is when it comes to cracking a password. The higher your precision, the fewer guesses on average you need to make to crack a password. As a rule of thumb if you see a graph with number of guesses on the X axis and percentage of passwords cracked on the Y axis, it's probably measuring precision.

An example of measuring the precision of different cracking techniques
Recall on the other hand is the percentage of passwords cracked during a cracking session in total, regardless of how many guesses are made. Usually this isn't represented in a graph format, and if it is, the X axis will be represented by "Time", and not number of guesses.

Courtesy of Korelogic's Crack Me If You Can contest. This represents a Recall based graph
It's tempting to say that "Precision" is a theoretical measurement and "Recall" is the practical results. It's not quite so clear cut though since the "time" factor in password cracking generally boils down to "number of guesses". In an online guessing scenario an attacker may only be able to make 10-20 guesses. With a fast hash, offline attack, and a moderate GPU setup, billions of guesses a second are possible and an attack might run for several weeks. Therefore recall results tend to be highly dependent of the particular situation being modeled.

Now it would be much easier to switch between "Precision" and "Recall" if there was a direct mapping between number of guesses and time. The problem is, not all guesses take the same amount of time. A good example of that is CPU vs GPU based guessing algorithms. Going back to John the Ripper's Incremental mode, I'm not aware of any GPU implementation of it so guesses have to be generated by the CPU and then sent to the GPU for hashing. Meanwhile Hashcat's Markov mode can run in the GPU itself, and in Atom's words "it has to create 16 billions candidates per 10 milliseconds on a single GPU. Yes, billions". Therefore this can lead to situations such in the case of a very fast hash where certain attacks might have a higher precision, but worse recall.

Amdahl's law and why I find precision interesting

When trying to increase recall an attacker generally has two different avenues to follow. They can increase the number of guesses they make or they can increase the precision of the guesses they make. These improvements aren't always exclusive; many times you can do both. Often though there is a balancing act as more advanced logic can take time and may be CPU bound. What this means is that you might increase precision only to find your recall has fallen since you are now making fewer guesses. That being said, if the increase in precision is high enough, then even an expensive guessing algorithm might do well enough to overcome the decrease in the total number of guesses it can make.

Often in these optimization situations Amdahl's law pops into my head, though Gustafson's law might be more appropriate for password cracking due to the rate of increase in the number of guesses. Amdahl's law in a nutshell says the maximum speedup you can have is always limited by the part of the program you can't optimize. To put it another way, if you reduce the cost of an action by 99%, but that action only accounts for 1% of the total run-time, then your maximum total speedup no matter how cool your optimization is would be no more than 1%.

Where this applies to password cracking is the cost of  a guess in an offline cracking attack can be roughly modeled as:
Cost of making the plain-text guess + cost of hashing + general overhead of the cracking tool
Right now the situation in many cases is that the cost of hashing is low thanks to fast unsalted hashing algorithms and GPU based crackers. Therefore it makes sense to focus on reducing the cost of making the plain-text guesses as much as possible since that will have a huge impact on the overall cost of making a guess. Aka, trading precision for speed in your guessing algorithm can have a significant impact on the total number of guesses you can make. If on the other hand a strong hash is used, (or you at least are trying to crack a large number of uniquely salted hashes), the dominant factor in the above equation becomes the hashing itself. Therefore a speedup in the plaintext generation will not have as much impact on the overall cost and therefore precision becomes more important.

As a researcher, precision is very interesting for me. From a defensive standpoint a good starting place is "use a computationally expensive salted hash". If you aren't at least doing that then the chances are you aren't interested in doing anything more exotic. Also when it comes to contributing to the larger research community, well my coding skills are such that I'm not going to be making many improvements to the actual password cracking tools. Evaluating and improving the precision of different attacks is much more doable.

Carnegie Mellon's Password Guessability Service:

One cool resource for password security researchers is the new Password Guessability service being offered by the CUPs team over at Carnegie Mellon. I'm going to paraphrase their talk, but basically their team got tired of everyone comparing their passwords attacks to the same default rulesets of John the Ripper so they created a service  for researchers to model more realistic password cracking sessions. If you are interested their USNIX paper describing their lab setup can be found here. Likewise if you want to see a video of their Passwords15LV talk you can view it here. More importantly, if you want to go to their actual site you can find it here:

The service itself is free to ethical security researchers and is run by students so don't be a jerk. The actual attacks they run are bound to change with time, but as of right now they are offering to model several different default password cracking attacks consisting of around 10 trillion guesses each. These cracking attacks use the public TrustWave's JtR KoreLogic Rulelist, several different HashCat rulesets, an updated Probabilistic Context Free Grammar attack, and another custom attack designed by Korelogic specifically for this service. All in all, if you need to represent an "industry standard" cracking session it's hard to do better. In fact it probably represents a much more advanced attacker than many of the adversaries out there if you assume the target passwords were protected by a hashing algorithm of moderate strength.

I could keep on talking about this service but you really should just read their paper first. I think it's a wonderful resource for the research community and I have lots of respect for them offering this. So the next question of course is what does that mean for this blog? I plan on using this service as it makes sense without hogging Carnegie Mellon's resources. I need to talk to them more about it but I expect that I'll  have them run it against a subset of the RockYou list and then use, and reuse, those results to evaluate other cracking techniques as I investigate them. If I attack some other dataset though I may just run a subset of the attacks myself, unless that dataset and the related tests are interesting enough to make using CM's resources worth it.

Fun with Designing Experiments:

When designing experiments there's usually a couple of common threads I'm always struggling with:
  1. Poor datasets. I know there's a ton of password dumps floating around but often due to the nature of their disclosure there's massive problems or shortcomings with most of them. For example most of the dumps on password cracking forums or pastebin have only unique hashes, so '123456' only shows up once, and there is no attribution. Gawker was a site most people didn't care about and the hashing algorithm cut off the plaintext after 8 characters and replaced non-ASCII text as a '?'. A majority of the passwords in the Stratfor dataset were machine generated. Myspace, well that was a result of a phishing attack so it has many instances of 'F*** You You F***ing Hacker'. Even with RockYou the dataset is complicated as it contained many passwords from the same users for different sites but since there were no usernames connected with the public version of it, it can be hard to sort out. Then there is the fact that most of these datasets were for fairly unimportant sites. I'm not aware of any confirmed public Active Directory dump, (though there are a large number of NT hashes floating about and this whole Ashley Madison hack may change things with the Avid Life Media NT hashes there). Likewise, while there are some banking password lists, the amount of drama surrounding them makes me hesitant to use them.
  2. Short running time. Personally I like keeping the time it takes to run a test to around an hour or so. While I can certainly run longer tests, realistically anything over a couple of days isn't going to happen since I like using my computers for other things and truth be told, it always seems like I end up finding out I need to run additional tests or I messed something up in my original setup and need to re-run it. Shorter tests are very much preferred. Add into that the fact that most of the time I'm modeling precision and running my tests on a CPU system means most of my tests will not be modeling GPU cracking vs fast hashes.
  3. What hypothesis do I want to test, and can I design an experiment to test it? I'll admit, sometimes I'll have no clue what the results of a test will be so I'll pull a YOLO, throw some stuff together and just run it to see what pops out. That's not ideal though as I usually like to try and predict the results. I'm often wrong, but that at least forces me to look deeper into what assumptions I held were wrong, and hey that's why I run tests in the first place.
Furthermore, for at least the next couple of tools I'm investigating I plan on using both Hashcat and John the Ripper as much as possible. While it might not always make sense to use both of them as often there isn't an apples to apples comparison, I do have some ulterior motives. Basically it helps me to use both of these tools in a public setting and I've already gotten a lot of positive feedback from my PRINCE post. It's pretty amazing when I can have a creator of a tool tell me how I can optimize my cracking techniques. My secondary reason for this is to make people more aware of both of these tools. When it comes to the different attack modes I've found there's a lot of misunderstandings of what each tool is capable of.

That being said, I explicitly don't want to get into "Tool A is better than Tool B" type debates. Which tool you use really depends on your situation. Heck, occasionally I'm glad I still have Cain and Abel installed. I'll admit, this is going to get tricky when I'm doing tests such as comparing Hashcat's Markov mode to JtR's Incremental mode, but please keep in mind that I want to make all the tools better.

Enough talk; Give us some code or graphs or GTFO:

Thanks for putting up with all of that text. In the spirit of showing all my research I'm sharing the tool that I wrote to evaluate password cracking sessions which I'll be using in this blog. The code is available here:

The specific tool I'm talking about, (in the hope that I release multiple tools in the future so it isn't obvious ;p), is called It's a significantly faster version of the old checkpass program I had used and released in the past. The options on how it works are detailed in the -h switch, but basically you can pipe whatever password guess generation tool you are using into it and it'll compare your guesses against a plaintext target list and tell you how effective your cracking session would have been. For example if you were using John the Ripper you could use the -stdout option to model a cracking session as follows:
./john -wordlist=passwords.lst -rules=single -stdout | python -t target.pws -o results.txt
It also has some options like limiting the maximum number of guesses or starting a count at a specific number if you want to chain multiple cracking sessions together. There's certainly still a lot of improvements that need to be made to it, but if you like graphs I hope it might be useful to you. Please keep in mind that this isn't a password cracker. Aka, It does not do any hashing of password guesses. So if you want to model a password cracking session against a hashed list you'll need to run two attacks, One to crack the list using the tool of your choice, and a second session to use the tool to model your cracking session against the cracked passwords. Since both John the Ripper and Hashcat have logging options you might want to consider using them instead to save time. Where checkpass2 is nice for me anyway is the fact that I can quickly edit the code depending on what I need so it's easier to do things like format the output for what I'm doing. Long story short, I hope it is helpful but I still strongly recommend looking into the logging options that both John the Ripper and Hashcat offer.

Monday, December 22, 2014

Tool Deep Dive: PRINCE

Tool Name: PRINCE (PRobability INfinite Chained Elements)
Version Reviewed: 0.12
Author: Jens Steube, (Atom from Hashcat)
OS Supported: Linux, Mac, and Windows
Password Crackers Supported: It is a command line tool so it will work with any cracker that accepts input from stdin

Blog Change History:

1/4/2015: Fixed some terminology after talking to Atom
1/4/2015: Removed a part in the Algorithm Design section that talked about a bug that has since been fixed in version 0.13
1/4/2015: Added an additional test with PRINCE and JtR Incremental after a dictionary attack
1/4/2015: Added a section for using PRINCE with oclHashcat

Brief Description: 

PRINCE is a password guess generator and can be thought of as an advanced Combinator attack. Rather than taking as input two different dictionaries and then outputting all the possible two word combinations though, PRINCE only has one input dictionary and builds "chains" of combined words. These chains can have 1 to N words from the input dictionary concatenated together. So for example if it is outputting guesses of length four, it could generate them using combinations from the input dictionary such as:
4 letter word
2 letter word + 2 letter word
1 letter word + 3 letter word
1 letter word + 1 letter word + 2 letter word
1 letter word + 2 letter word + 1 letter word
1 letter word + 1 letter word + 1 letter word + 1 letter word
..... (You get the idea)

Algorithm Design:

As of this time the source-code of PRINCE has not been released. Therefore this description is based solely on At0m's Passwords14 presentation, talking to At0m himself on IRC as well as running experiments with various small dictionaries using the tool itself and manually looking at the output.

As stated in the description, PRINCE combines words from the input dictionary to produce password guesses. The first step is processing the input dictionary. Feeding it an input dictionary of:
resulted it in generating the following guesses:
...(output cut to save space)
Therefore, it's pretty obvious that the tool does not perform duplicate detection when loading a file

Finding #1: Make sure you remove duplicate words from your input dictionary *before* you run PRINCE

After PRINCE reads in the input dictionary it stores each word, (element), in a table consisting of all the words of the same length. PRINCE then constructs chains consisting of 1 to N different elements. Right now it appears that N is equal to eight, (confirmed when using the --elem-cnt-min option). It does this by setting up structures of the different tables and then filling them out. For example with the input dictionary:
It will generate the guesses:
This isn't to say that it won't generate longer guesses since elements can be longer then length 1. For example with the following input dictionary:
It generates the following guesses
...(output cut to save space)
Next up, according to the 35 slide of the Passwords14 talk it appears that Prince should be sorting these chains according to keyspace. This way it can output guesses from the chains with the smallest keyspace first. This can be useful so it will do things like append values on the end of dictionary words before it tries a full exhaustive brute force of all eight character passwords. While this appears to happen to a certain extent, something else is going on as well. For example with the input dictionary:
It would output the following results:
.....(Lots of results omitted).....
This is a bit of a mixed bag. While it certainly saved the highest keyspace chains for the end, it didn't output everything in true increasing keyspace order since elements of length 1, (E1), had two items, while elements of length 2, (E2), only had one item, but it outputted E1 first. I have some suspicions that the order it outputs its chains is independent on how many items actually are in each element for that particular run, (aka as long as there is at least one item in each element, it is independent of your input dictionary). I don't have anything hard to back up that suspicion though beyond a couple of sample runs like the one above. Is this a problem? Quite honestly, I'm not really sure, but it is something to keep in mind. When I talked to Atom about this he said that password length compared to the average length of items in the training set also influenced the order at which chains were selected so that may have something to do with it.

Finding #2: PRINCE is not guaranteed to output all chains in increasing keyspace order, though it appears to at least make an attempt to do so

Additional Options:

--elem-cnt-max=NUM:  This limits the number of elements that can be combined to NUM. Aka if you set NUM to 4, then it can combine up to 4 different elements. So if you had the input word 'a' it could generate 'aaaa' but not 'aaaaa'. This may be useful to limit some of the brute forcing it does.

The rest of the options are pretty self explanatory. One request I would have is for PRINCE to save its position automatically, or at least print out the current guess number when it is halted, to make it easier to restart a session by using the "--skip=NUM" option.


PRINCE was written by Atom so of course it is fast. If you are using a CPU cracker it shouldn't have a significant impact on your cracking session even if you are attacking a fast hash. For comparison sake, I ran it along with JtR's incremental mode on my MacBook Pro.

run laki$ ../../../Tools/princeprocessor-0.12/  < ../../../dictionaries/passwords_top10k.txt | ./john --format=raw-sha1-linkedin -stdin one_hash.txt
Loaded 1 password hash (Raw SHA-1 LinkedIn [128/128 SSE2 intrinsics 8x])
guesses: 0  time: 0:00:02:00  c/s: 1895K  trying: asdperkins6666 - bobperkins

JtR Incremental Mode:
run laki$ ./john -incremental=All -stdout | ./john --format=raw-sha1-linkedin -stdin one_hash.txt 
Loaded 1 password hash (Raw SHA-1 LinkedIn [128/128 SSE2 intrinsics 8x])
guesses: 0  time: 0:00:00:14  c/s: 2647K  trying: rbigmmi - rbigm65

Using PRINCE with OCLHashcat:

Below is a sample screen shot of me using PRINCE as input for OCLHashcat on my cracking box, (it has a single HD7970 GPU). Ignore the --force option as I had just installed an updated video card driver and was too lazy to revert back to my old one that OCLHashcat supports. I was also too lazy to boot into Linux since I was using Excel for this post and my cracking box also is my main computer...

What I wanted to point out was that for a fast hash, (such as unsalted SHA1 in this case), since PRINCE is not integrated into OCLHashcat it can't push guesses fast enough to the GPU to take full advantage of the GPU's cracking potential. In this case, the GPU is only at around 50% utilization. That is a longer way of saying that while you still totally make use of OCLHashcat when using PRINCE, it may be adventurous to also run dictionary based rules on the guesses PRINCE generates. Since those dictionary rules are applied on the GPU itself you can make a lot more guesses per second to take full advantage of your cracking hardware. This is also something Atom recommends and he helpfully included two different rulesets with the PRINCE tool itself.

Side note: PRINCE plows though the LinkedIn list pretty effectively. To get the screenshot above I had to run the cracking session twice since otherwise the screen would have been filled with cracked passwords.

Big Picture Analysis:

The main question of course is how does this tool fit into a cracking session? Atom talked about how he saw PRINCE as a way to automate password cracking. The closest analogy would be John the Ripper's default behavior where it will start with Single Crack Mode, (lots of rules applied to a very targeted wordlist), move on to Wordlist mode, (basic dictionary attack), and then try Incremental mode, (smart bruteforce). Likewise with PRINCE depending on how you structure your input dictionary it can act as a standard dictionary attack, (appending/prepending digits to input words for example), combinator attack, (duh), and pure brute force attack, (trying all eight character combos). It can even do a limited passpharse attack though it gets into "Correct Horse Battery Staple" keyspace issues then. For example, with the input dictionary of:
It will generate all four word combinations such as:


When talking about passpharse attacks then, keep in mind it doesn't have any advanced logic so you are really doing a full keyspace attack of all the possible combinations of words.

The big question then is how does it compare against other attack modes when cracking passwords? You know what this means? Experiments and graphs!

I decided I would base my first couple of comparisons using the demos Atom had listed in his slides as a starting point. I figure no-one would know how to use PRINCE better than he would. Note: these are super short runs. While I could explain that away by saying this simulates targeting a slow hash like bcrypt, the reality is Atom made some noticeable changes in PRINCE while I was writing this post, (yay slow update schedule). I figured it would be good to make some quick runs with the newer version to get a general idea of how PRINCE performs and then post a more realistic length run at a later time. Also, this way I can get feedback on my experiment design so I don't waste time running a longer cracking session on a flawed approach.

Experiment 1) PRINCE, Hashcat Markov mode, and JtR Incremental mode targeting the MySpace list

Experiment Setup:
The input dictionary for PRINCE was the top 100k most popular passwords from the RockYou list, as this is what Atom used. For Hashcat I generated a stats file on the full RockYou list and used a limit of 16. For JtR I ran the  default Incremental mode using the "All" character set. The target list was the old MySpace list. The reason why I picked that vs the Stratfor dataset which Atom used was simply because there are a ton of computer generated passwords, (aka default passwords assigned to users), in the Startfor dataset so it can be a bit misleading when used to test against.

Cracking Length: 1 billion guesses

Commands used:
laki$ ../../../Tools/princeprocessor-0.12/ < ../../../dictionaries/Rockyou_top_100k.txt | python -t ../../../Passwords/myspace.txt -m 1000000000

laki$ ../../../John/john-1.7.9-jumbo-7/run/john -incremental=All -stdout | python -t ../../../Passwords/myspace.txt -m 1000000000

laki$ ../../../hashcat/statsprocessor-0.10/ --threshold=16 ../../../hashcat/statsprocessor-0.10/hashcat.hcstat | python -t ../../../Passwords/myspace.txt -m 1000000000

Experiment Results:

Click on the graph for a zoomed in picture. As you can see, Prince did really well starting out but then quickly became less effective. This is because it used most, (if not all), of the most common words in the RockYou list first so it acted like a normal dictionary attack. At the same time, Incremental Mode was starting to catch up by the end of the run. While I could continue to run this test over a longer cracking session, this actually brings up the next two experiments....

Experiment 2) PRINCE and Dictionary Attacks targeting the MySpace list

Experiment Setup:
This is the same as the previous test targeting the MySpace dataset, but this time using dictionary attacks. For JtR, I stuck with the default ruleset and the more advanced "Single" ruleset. I also ran a test using Hashcat and the ruleset Atom included along with PRINCE, (prince_generated.rule). For all the dictionary attacks, I used the RockYou top 100k dictionary to keep them comparable to the PRINCE attack.

Cracking Length: I gave each session up to 1 billion guesses, but the two JtR attacks were so short that I only displayed the first 100 million guesses on the graph so they wouldn't blend in with the Y-axis. The hashcat attack used a little over 700 million guesses which I annotated its final results on the graph. Side note, (and this merits another blog post), but Hashcat performs its cracking sessions using word order, vs JtR's rule order. I suspect this is to make hashcat faster when cracking passwords using GPUs. You can read about the difference in those two modes in one of my very first blog posts back in the day. What this means is that Hashcat's cracking sessions tend to be much less front loaded unless you take the time to run multiple cracking sessions using smaller mangling rulesets.

Commands used:
laki$ ../../../Tools/princeprocessor-0.12/ < ../../../dictionaries/Rockyou_top_100k.txt | python -t ../../../Passwords/myspace.txt -m 1000000000

laki$ ../../../John/john-1.7.9-jumbo-7/run/john -wordlist=../../../dictionaries/Rockyou_top_100k.txt -rules=wordlist -stdout | python -t ../../../Passwords/myspace.txt -m 1000000000

laki$ ../../../John/john-1.7.9-jumbo-7/run/john -wordlist=../../../dictionaries/Rockyou_top_100k.txt -rules=single -stdout | python -t ../../../Passwords/myspace.txt -m 1000000000

laki$ ../../../hashcat/hashcat-0.48/ --stdout -a 0 -r ../../../Tools/princeprocessor-0.12/prince_generated.rule ../../../dictionaries/Rockyou_top_100k.txt | python -t ../../../Passwords/myspace.txt -m 1000000000

Experiment Results:

As you can see, all of the dictionary attacks performed drastically better than the PRINCE over the length of their cracking sessions. That's to be expected since their rulesets were crafted by hand while PRINCE generates its rules automatically on the fly. I'd also like to point out that once the normal dictionary attacks are done, PRINCE keeps on running. That's another way of saying that PRINCE still has a role to play in a password cracking session even if standard dictionary attacks initially outperform it. All this test points out is if you are going to be running a shorter cracking session you would be much better off running a normal dictionary based attack instead of PRINCE. This does lead to my next question and test though. After you run a normal dictionary attack, how does PRINCE do in comparison to a Markov brute force based attack?

Experiment 3) PRINCE and JtR Wordlist + Incremental mode targeting the MySpace list

Experiment Setup:
Based on feedback from Atom I decided to restructure this next test. First of all, Atom recommended using the full Rockyou list as an input dictionary for PRINCE. Since that is a larger input dictionary than just the first 100k most frequent passwords, I re-ran JtR's single mode ruleset against the MySpace list using the full Rockyou dictionary as well. I also used the most recent version of JtR, 1.8-jumbo1 based on the recommendation of SolarDesigner. This cracked a total of 23,865 passwords from the MySpace list, (slightly more than 64%). I then ran PRINCE, (the newer version 0.13) with the full RockYou dictionary, (ordered), and JtR Incremental=UTF8, (equivalent to "ALL" in the older version of JtR), against the remaining uncracked passwords. I also increased the cracking time to 10 billion guesses.

Side note: I ran a third test PRINCE using the RockYou top 100k input dictionary as well since the newer results were very surprising. I'll talk about that in a bit...

Cracking Length: 10 billion guesses

Commands used:
laki$ ../../../John/john-1.8.0-jumbo-1/run/john -wordlist= ../../../dictionaries/Rockyou_full_ordered.txt -rules=single -stdout | python -t ../../../Passwords/myspace.txt -u uncracked_myspace.txt

laki$ ../../../Tools/princeprocessor-0.13/ < ../../../dictionaries/Rockyou_full_ordered.txt | python -t ../../../Passwords/uncracked_myspace.txt -m 10000000000 -c 23865

laki$ ../../../Tools/princeprocessor-0.13/ < ../../../dictionaries/Rockyou_top_100k.txt | python -t ../../../Passwords/uncracked_myspace.txt -m 10000000000 -c 23865

laki$ ../../../John/john-1.8.0-jumbo-1/run/john -incremental=UTF8 -stdout | python -t ../../../Passwords/uncracked_myspace.txt -m 10000000000 -c 23865

Experiment Results:

I'll guiltily admit before running this test I hadn't been that impressed with PRICE. That's because I had been running it with the top 100k RockYou dictionary. As you can see, with the smaller dictionary it performed horribly. When I ran the new test with the full RockYou dictionary though, PRINCE did significantly better than an Incremental brute force attack. Yes, cracking 1.5% more of the total set might not seem like much, but it will take Incremental mode a *long* time to catch up to that. Long story short though, PRINCE's effectiveness is extremly dependend on the input dictionary you use for it.

Like most surprising test results, this opens up more questions then it solves. For example, what exactly is going on with PRINCE to make it so much more effective with the new dictionary. My current hypothesis is that it is emulating a longer dictionary attack, but I need to run some more tests to figure out if that's the case or not. Regardless, these results show that PRINCE appears to be a very useful tool to have in your toolbox if you use the right input dictionary for it.

Current Open Questions:

  1. What is the optimal input dictionary to use for PRINCE? Yes the full RockYou input dictionary does well but my gut feeling is we can do better. That leads me to the next open question...
  2. Can we make PRINCE smarter? Right now it transitions between dictionary attacks and brute force automatically, but beyond sorting the chains by keyspace it doesn't have much advanced logic in it. Perhaps if we can better understand what makes it effective we can make a better algorithm that is even more effective than PRINCE.

Other References:

Tuesday, December 9, 2014

Don't call it a comeback ... Ok, maybe it is

Has it really been four years since my last post!? Well I guess it has! So a better question is: "What has changed for me to bring this blog back?"

Well last February I  got rid of my apartment, took leave of my job, and hiked all 2185.3 miles of the Appalachian Trail from Georgia to Maine. During that time I realized how much blogging meant to me. Long story short, I've been working with my employer to figure out how I could restart this blog now that I'm back.

I don't want to spend too much time on this news post but I might as well end it on one more question: "What should you expect?" That's up in the air right now and I plan on remaining flexible, but I feel one thing I have to contribute to the research community is the fact that I really do enjoy constructing and running experiments. I may not be the best coder, l33t3st password cracker, or have a million dollar cracking setup, but I do have mad Excel skills and I love digging into algorithms. Right now I'm investigating the new PRINCE tool from At0m, creator of Hashcat, (You can get it here) so hopefully I should have a post up about it in a couple of days.

Balance, I have it

Saturday, October 30, 2010

CCS Paper Part #2: Password Entropy

Round Peg, Square Hole
This is part #2 in a (mumble, cough, mumble) part serious of posts discussing the results published in the paper I co-authored on the effectiveness of passwords security metrics. Part #1 can be found here.

I received a lot of insightful comments on the paper since my last post, (one of the benefits of having a slow update schedule), and one thing that stands out is people really like the idea of password entropy. Here’s a good example:
“As to entropy, I think it would actually be a good measure of password complexity, but unfortunately there's no way to compute it directly. We would need a password database comparable in size (or preferably much larger than) the entire password space in order to be able to do that. Since we can't possibly have that (there are not that many passwords in the world), we can't compute the entropy - we can only try to estimate it in various ways (likely poor)”
First of all I want to thank everyone for their input and support as I really appreciate it. This is one of the few cases though where I’m going to have to disagree with most of you. In fact, as conceited as it sounds, my main takeaway has been that I've done a poor job of making my argument, (or I’m wrong which is always a possibility). So the end result is another post on the exciting topic of password entropy ;)

When I first started writing this post, I began with a long description on the history of Shannon Entropy, how it’s used, and what it measures. I then proceeded to delete what I had written since it was really long, boring, and quite honestly not that helpful. All you need to know is:
  1. Claude Shannon was a smart dude.
  2. No seriously, he was amazing; He literally wrote the first book on modern code-breaking techniques.
  3. Shannon entropy is a very powerful tool used to measure information entropy/ information leakage.
  4. Another way of describing Shannon entropy is that it attempts to quantify how much information is unknown about a random variable.
  5. It’s been effectively used for many different tasks; from proving one time pads secure, to estimating the limits of data compression.
  6. Despite the similar sounding names, information entropy and guessing entropy are not the same thing.
  7. Yes, I’m actually saying that knowing how random a variable is doesn’t tell you how likely it is for someone to guess it in N number of guesses, (with the exception of the boundary cases where the variable is always known – aka the coin is always heads- or when the variable has an even distribution – aka a perfectly fair coin flip).
Ok, I’ll add one more completely unnecessary side note about Shannon Entropy. Ask a crypto guy, (or gal), if the Shannon entropy of a message encrypted with a truly random and properly applied one time pad is equal to the size of the key. If they say “yes”, point and laugh at them. The entropy is equal to that of original message silly!

Hey, do you know how hard it is to make an entropy related joke? I’m trying here…

Anyways, to calculate the entropy of a variable you need to have a fairly accurate estimate of the underlying probabilities of each possible outcome. For example a trick coin may land heads 70% of the time, and tails the other 30%. The resulting Shannon entropy is just a summation of the probability of each event multiplied by the log2 of its probability, (and then multiplied by -1 to make it a positive value). Aka:

So the Shannon entropy of the above trick coin would be -(.7 x log2(.7) + .3 x log2(.3)) which is equal to 0.8812 bits. A completely fair coin flip’s entropy would be equal to 1.0. In addition, the total entropy of different independent variables is additive. This means the entropy of flipping the trick coin and then the fair coin would be .8812 + 1.0 = 1.8812 bits worth of entropy.

I probably should have put a disclaimer above to say that you can live a perfectly happy life without understanding how entropy is calculated.

The problem is that while the Shannon entropy of a system is determined using the probability of the different outcomes, the final entropy measurement does not tell you about the underlying probability distribution. People try to pretend it does though, which is where they get into trouble. Here is a picture, (and a gratuitous South Park reference), that I used in my CCS presentation to describe NIST’s approach to using Shannon entropy in the SP800-63 document:

Basically they take a Shannon entropy value, assume the underlying probability distribution is even, and go from there. Why this is an issue is that when it comes to human generated passwords, the underlying probability distribution is most assuredly not evenly distributed. People really like picking “password1”, but there is always that one joker out there that picks a password like “WN%)vA0pnwe**”. That’s what I’m trying to say when I show this graph:

The problem is not that the Shannon value is wrong. It’s that an even probability distribution is assumed. To put it another way, unless you can figure out a method to model the success of a realistic password cracking session using just a straight line, you’re in trouble.

Let me make this point in another way. A lot of people get hung up on the fact that calculating the underlying probability distribution of a password set is a hard problem. So I want to take a step back and show you this holds true even if that is not the case.

For an experiment, I went ahead and designed a variable that has 100 possible values that occur at various probabilities, (thanks Excel). This means I know exactly what the underlying probability distribution is. This also means I’m able to calculate the exact Shannon entropy as well. The below graph shows the expected guessing success rate against one such variable compared to the expected guessing success generated by assuming the underlying Shannon entropy had an even distribution.

Now tell me again what useful information the Shannon entropy value tells the defender about the resistance of this variable to a guessing attack? What’s worse is the graph below that shows 3 different probability distributions that have approximately the same entropy, (I didn’t feel like playing around with Excel for a couple of extra hours to generate the EXACT same entropy; This is a blog and not a research paper after-all).

These three variables have very different resistance to cracking attacks, even though their entropy values are essentially the same. If I want to get really fancy, I can even design the variables in such a way that the variable with a higher Shannon entropy value is actually MORE vulnerable to a shorter cracking session.

This all comes back to my original point that the Shannon entropy doesn’t provide “actionable” information to a defender when it comes to selecting a password policy. Even if you were able to perfectly calculate the Shannon entropy of a password set, the resulting value still wouldn’t tell you how secure you were against a password cracking session. What you really want to know as a defender is the underlying probably distribution of those passwords instead. That's something I've been working on, but I’ll leave my group’s attempts to calculate that for another post, (hint, most password cracking rule-sets attempt to model the underlying probability distribution because they want to crack passwords as quickly as possible).

Thursday, October 7, 2010

New Paper on Password Security Metrics

I'm in Chicago at the ACM CCS conference, and the paper I presented there: "Testing Metrics for Password Creation Policies by Attacking Large Sets of Revealed Passwords", is now available online.
Since I had the paper and presentation approved through my company's public release office I was given permission to blog about this subject while the larger issue of my blog is still going through the proper channels. Because of that I'm going to limit my next couple of posts to this subject rather than talking about the CCS conference as a whole, but let me quickly point you to the amazing paper "The Security of Modern Password Expiration: An Algorithmic Framework and Empirical Analysis", written by Yinqian Zhang, Fabian Monrose and Michael Reiter. In short, they managed to obtain a great dataset, their techniques were innovative and sound, and there's some really good analysis on how effective password expiration policies really are, (spoiler: forcing users to change their password every six months isn't very useful).

I'd like to first start by acknowledging the other authors who contributed to the "Testing Password Creation Metrics..." paper.
  • Dr. Sudhir Aggarwal - Florida State University: My major professor, who spent I don't know how many hours helping walk me through the subtle intricacies of information entropy.
  • Michael Collins - Redjack LLC: Another data driven researcher, and much cooler than me since he uses GNUPlot instead of Excel ;)
  • Henry Stern - Cisco IronPort: He was the driving force behind getting this paper written. It was over lunch at the Microsoft Digital Crime Consortium, (it's a conference to combat cybercrime, and not a group of people from Microsoft looking to commit digital crime like the name implies...), that the framework for this paper was laid out.
As for the contents of the paper, I'm planning on breaking the discussion about it down into several different posts, with this post here being more of an overview.

When writing this paper, we really had two main goals:
  1. How does the NIST model of password entropy as defined in SP800-63 hold up when exposed to real password datasets and realistic attacks?
  2. How much security is actually provided by typical password creation policies, (aka minimum length, character requirements, blacklists)?
Based on our results, we then looked at the direction we would like password creation policies move to in the future. This ended up with us suggesting how to turn our probabilistic password cracker around, and instead use it as part of a password creation strategy that allows people to create passwords however they like, as long as the probability of the resulting password remains low.

Of all that, I feel our analysis of the NIST password entropy model is actually the most important part of the paper. I know it sounds like an esoteric inside baseball subject, but the use of NIST's password entropy model has a widespread impact on all of us. This is because it provides the theoretical underpinning for most password creation policies out there. Don't take my word for how widespread the use of it is. Check out the Wikipedia article on password strength, (or better yet, read the discussion page) for yourself.

Our findings were that the NIST model of password entropy does not match up with real world password usage or password cracking attacks. If that wasn't controversial enough, we then made the even more substantial claim that the current use of Shannon Entropy to model the security provided by human generated passwords at best provides no actionable information to the defender. At worse, it leads to a defender having an overly optimistic view of the security provided by their password creation policies while at the same time resulting in overly burdensome requirements for the end users.

Getting in front of a room full of crypto experts and telling them that Shannon Entropy wasn't useful to evaluate the security of password creation policies and "We REALLY need to STOP using it", was a bit of a gut clenching moment. That's because the idea of information entropy is fairly central to the evaluation of most cryptographic algorithms. I would have never done it except for the fact that we have a lot of data backing this assertion up. The reason we are making the broader point is because it's tempting to dismiss the flaws in the NIST model by saying that NIST just estimated the entropy of human generated passwords wrong. For example, if you juggle the constants around or perhaps look at word entropy vs character entropy, things will work out. Our point though is not that you can't come up with a fairly accurate Shannon entropy model of human generated passwords. You most assuredly can. It's just that it's not apparent how such a model can provide "actionable information". In addition, the way we currently use Shannon Entropy in evaluating password security policies is fundamentally flawed.

This subject really does require another blog post, but before I head back to Boston I wanted to leave you with one of the graphs from our paper that demonstrates what I'm talking about:

The above graph shows cracking sessions run against passwords that met different minimum length password creation requirements, (aka must be at least seven characters long). The NIST estimated cracking speed is based on the calculated NIST entropy of passwords created under a seven character minimum password creation policy. You may notice that it overestimates the security of the creation policy over shorter cracking sessions, but at the same time doesn't model longer cracking sessions either. This is what I keep on saying that it doesn't provide "actionable intelligence", (third time and counting). When we say "password entropy" what we really want to know is the Guessing Entropy of a policy. Unfortunately, as a community, we keep using Shannon entropy instead. Guessing entropy and Shannon entropy are two very different concepts, but unfortunately there doesn't exist a very good way of calculating the guessing entropy, while calculating the Shannon entropy of a set of text is well documented. This is part of the reason why people keep trying to use Shannon entropy instead.

So I guess I should end this post by saying, if any of this sounds interesting please read the paper ;)

Friday, September 10, 2010

Quick Status Update

This is just a quick post to let you know that I for once have a valid excuse for not updating this blog in a timely manner. I actually found a job! Thanks to everyone who offered help, recommendations and encouragements. The only catch is that right now it's being decided if I have to run my posts through our public release office or not. Don't worry, this blog is not going away regardless of the decision.. It might just gain a few unwilling readers ;)

As to my new company, I'm going to keep that a bit of an open secret. This blog reflects my personal views. I certainly don't speak for them, and I plan on avoiding any topics that have to do with my day job, (Don't worry, I'm not doing any password cracking there).

Once again thanks, and I'll resume posting once I get the OK and can update this blog while complying with company policies. I just want to make sure I handle this situation the right way.

Wednesday, July 28, 2010

Defcon Crack Me if You Can Competition

I'd be remiss if I didn't spend a little time talking about the "Crack Me if you Can" competition at Defcon. It's really been amazing the amount of interest that this contest is drumming up. People are excited; it seems like everyone is refining their mangling rules, putting together new wordlists, and finishing up various password cracking tools. The impact that this is having on the password cracking community as a whole is hard to overstate. Needless to say, I'm a fan of that, and I have a ton of respect for Minga and the folks at KoreLogic for putting this together.

I'll be participating, though I certainly don't plan on winning. What I'm really looking forward to though is the chance to meet with everyone else and learn what other people are doing. I'm hoping this turns into an event like the lockpicking village with the contest being almost besides the point. Of course I might be saying that because I'm going to get creamed as well...

Since I've had a few people ask me about the competition itself, here's my two cents. My biggest concern is that the passwords we will be cracking aren't real. This isn't a criticism. There's no way you could run this competition with real corporate passwords, (well, legally that is...). It's just something to keep in mind. What will be interesting though is applying the techniques learned from the winner, (part of the rules are that you have to disclose your cracking techniques), to other datasets as they become available. That's why I have this blog. I might not be the best password cracker out there, but I can certainly run other people's attacks and plot the results on Excel ;)

If I had to hazard a guess, here's some predictions of mine about the contest:

1) Most passwords will be based on relatively common dictionary words. Way more so than you would find normally.

2) Most of the cracking will center around applying the correct mangling rules. Yes there will be the 'Dictionary123' words, but I expect most 'high score' passwords will have less common rules such as 'xD1ct1onaryx'.

3)There will probably be some LANMAN passwords, so bring your rainbow tables.

4) I expect there to be so many NTLM passwords that rainbow tables for them won't be cost effective.

5) I'll be interested to see if they have any 'exotic' password hashes. WinRAR, TrueCrypt, etc.

6) It'll be a ton of fun ;)

I'll see you guys there.