Analyzing Tokenizer Part 2: Omen + Tokenizer

 

“I have not failed. I've just found 10,000 ways that won't work”
Thomas Edison

Introduction:

This is a continuation of a deep dive into John the Ripper's new Tokenizer attack. Instruction on how to configure and run the original version of Tokenizer can be found [Here]. As a warning, those instructions need to be updated as a new version of Tokenizer has been released that makes it easier to configure. The first part of my analysis can be found [Here].

This is going to be a bit of a weird blog entry as this is a post about failure. Spoiler alert: If you are reading this post to learn how to crack passwords, just go ahead and skip it. My tests failed, my tools failed, and my understanding of my tools failed. A disappointing number of passwords were cracked in the creation of this write-up. I'll admit, I was very tempted to shelve this blog post. But I strongly believe that documenting failures is important. Often when reading blog posts you don't really see the messy process that is research. Stuff just doesn't work, error messages that are obvious in retrospect are missed, and tests don't always turn out the way you expect. So as you read this, understand that it's more a journal of troubleshooting research tests when they go wrong, vs. a documentation of what to do.

To put it another way, the main audience for this blog post is:
  • My future self. I'm really annoyed at myself for not better documenting some of my past work. I'll go into that in more detail later.
  • Password cracking tool developers. This post won't help you crack more passwords or develop better password security strategies. Hopefully it will help those who are developing the tools to help you crack those passwords though.
  • People who play dwarf fortress and like [Fun].

Question 1: Why are my Tokenizer's "first 25 guesses" different from Solar Designer's

Background:

In response to my previous blog entry, Solar Designer wrote: "One thing that surprised me is that your top 25 for training on RockYou Full (including dupes, right?) is different from what I had posted in here at all (even if similar)." [Link].

That's a good question, and one that I had been wondering as well. There's a couple of things that could be causing this, from the way our Linux shells handle character encoding, the order of our training lists, to differences in our training lists. Or it could be something totally different that I'm not imaginative enough to come up with yet. At a high level, it's probably not that big of a deal since our experiences running Tokenizer attacks seem roughly the same (Solar Designer has posted tests comparing it to Incremental mode, and they roughly match what I've been seeing). But this can be a useful rabbit hole to dive down since it can expose some optimizations or environmental issue that could cause problems as more people start to use this tool. There's a big gulf between "it works on my machine" and "its easy for anyone else to run".

Conclusion Up Front:

Tokenizer (and base-Incremental mode) seem resilient to the order of the passwords they are trained on, and setting  'export LC_CTYPE=C' did not seem to impact guess generation.

Bonus Finding:

When manually analyzing password guesses DO NOT pipe the output of a password cracking session into "less". At least in my WSL Ubuntu shell, this seemed to add artifacts into the guesses I was creating which gave me bad data. Note, This doesn't impact running actual password cracking sessions.

Instead, when using John the Ripper, make use of the "--max-candidates" option. Aka:

  • ./john --incremental=tokenize1 --external=untokenize1 --stdout --max-candidates=25

Discussion

This was an area where my analysis setup really let me down so I chased a lot of unproductive leads before I was able to find the ground truth. For my first test I sorted Rockyou1 and compared a Tokenizer attack trained on it to a Tokenizer attack trained on an unsorted Rockyou1 training set. Initially they appeared to generate different guesses. For example:

This led down an unproductive rabbit hole where I ended up generating a lot of different character sets for Incremental mode to try and track down what was causing the differences in guess generation. It wasn't until I got really frustrated and ran a "diff" on the different .chr files that Incremental mode uses and found they were EXACTLY THE SAME that I realized the problem might be in how I was displaying the guesses.

Still, I learned a few new things, and improved my testing process. So it wasn't a complete waste.

Question 2: How does the PCFG OMEN mode attack differ from the original Rub-SysSec version?

Background:

This question was inspired by a comment by @justpretending on the Hashcat Discord channel.

OMEN stands for Ordered Markov ENumerator and the original paper/implementation can be found [Here]. I became interested in it after it was presented at PasswordsCon where it was shown to be more effective than my PCFG attack and could generate guesses at speeds making it practical. That's certainly one way to get my attention! To better understand the OMEN attack I took the Rub-SysSec OMEN code and re-implemented it in Python. The standalone version of the python code (py-omen) is still available [Here]. Liking what I saw, I then replaced the existing Markov attack (based on JtR's --Markov mode) in the PCFG toolset with OMEN for the PCFG version 4 rewrite. 

That's a lot of words to say that while the different implementations weren't 1 to 1, I expected my version of OMEN be "mostly" similar to the original Rub-SysSec version. But it appears there are differences, so let's look into them!

Challenges with Unsupported Tools:

The first challenge I ran into was getting the stand-alone versions of py-omen to run. For example, I get the following error when try to generate guesses with py-omen:

I vaguely remember having to update my ConfigParser calls in the PCFG toolset, so that error tracks. My guess is if you ran py-omen with Python3.6 it would work, but it looks like it isn't compatible with Python3.12. While it is tempting to fix this bug now as having py-omen working would be nice for doing more experimentation with OMEN, it's really outside the scope of this investigation and the error message brings me joy. Long story short, I'm going to defer that work until a later point.

The important tool to run though is the C OMEN version developed by Rub-SysSec. The "make" build process worked without a hitch, but when I tried to run it the following error was displayed:

Looking into the open issues for the code I found one [Link] that highlighted that this problem occurs when you run it from an Ubuntu system. I verified this happens both with a WSL install of Ubuntu as well as a copy of Ubuntu running on bare metal. When I installed a Debian WSL environment though, I was able to get the original OMEN code to work.

Another challenge I ran into with the Rub-SysSec OMEN code was that by default it only generates 1 billion guesses and then stops. I "believe" you can override this using the "-e" endless flag, but I didn't figure this out until I had run my tests, so the following tests only display a 1 billion guess session vs. the 5 billion guess sessions I used in my previous blog posts.

Test 2) How does the original OMEN code perform compared to the PCFG OMEN code?

Test 2 Design:

Training the Rub-SysSec OMEN code on the RockYou1 training set (1 million random RockYou passwords), an attack will be run using a variation of the following command. Disclaimer, I didn't actually use the "-e" flag but I'm including it here to make it easier the next time I need to copy/paste a command from this blog into a terminal.
  • ./enumNG -p -e | python3 ../Password_Research_Tools/checkpass.py -m 1000000000 -t ../../research/password_lists/rockyou/rockyou_32.txt -o ../../research/blog/tokenize/OG_omen_rock1_rock32_stats.txt

You'll notice I don't specify a specific training ruleset for enumNG since the Rub-SysSec code only supports one training set at a time (aka you need to retrain it every time you want to use a different ruleset).

As to the target sets, I'm going to run enumNG against both the RockYou32 test set (a different set of 1 million passwords from RockYou), and the LinkedIn password dump that I used in the previous blog posts.

Test 2 Results:

Test 2 Discussion:

While I expected to see difference between the PCFG OMEN and the Rub-SysSec OMEN, I was still surprised by how much they differed. I obviously made some improvements in the PCFG version of OMEN while totally forgetting what they were. As you can see from these tests, the original Rub-SysSec OMEN performs comparably to the new JtR Tokenize attack (the original OMEN did better on Rockyou, but roughly the same or worse against LinkedIn).

The PCFG OMEN did much better though. These two attacks should be "mostly" the same! This difference in performance is like a grain of sand in my boot and I'd really like to better understand what makes them different. You'll notice both attacks have the "sawtooth" pattern though so there's certainly room for optimizing the underlying OMEN attack regardless of the implementation.

My first thought was these two tools used different Markov orders (or NGrams). Both of the tools can have the length of NGrams be specified via command line options during training, so having different default settings was a likely source of differences. Unfortunately when looking at their settings, both the Rub-SysSec and the PCFG OMEN use a default of NGram=4 (same as a 3rd order Markov chain). So that's ruled out. 

Another source of differences could be the alphabets each OMEN attacks uses. The alphabet is the set of characters OMEN selects from when generating guesses. One change I made to the PCFG OMEN code was to allow for a variable number of characters in the alphabet based on the training data, (as well as support for different character encodings such at UTF-8). You can see the differences between the two different tools which were both trained on the RockYou1 training data below:

While the different alphabets are probably the cause of some of the differences, given that the "extra" characters in the PCFG OMEN are unlikely to be in many passwords, this doesn't explain the entire difference. My current theory is that the probability smoothing and how the PCFG toolset uses the "Initial Probability" (IP) may be the source of many of the other differences. Side note: Neither Rub-SysSec or PCFG OMEN used "Ending Probability" (EP) by default.

What is very annoying though is I don't have any notes on what I did differently for smoothing, so to better understand this I need to dig back into the original Rub-SysSec version of OMEN as well as my own code. So this is something I need to research, but I'm going to defer most of that investigation to a later blog post.

TLDR: The Rub-SysSec and PCFG toolsets both use the OMEN attack, but there are implementation differences which cause them to behave very differently.

Question 3) Can the Tokenize approach be applied to the PCFG OMEN attack?

This test was brought up by SolarDesigner on the John-Users mailing and we discussed what this attack might look like [Here]. The proposed approach to test this can be summed up as follows:

  1. Use the tokenize.pl script to "modify" the training data with variable order Markov chains and create the "untokenize" JtR external mode, just like when running a normal tokenize attack with JtR's Incremental mode.
  2. Train a PCFG OMEN attack on this "modified" training data
  3. Use the PCFG OMEN attack to generate guesses, but before hashing/checking the password guesses, pipe the guesses into JtR --stdin mode to apply the --external=untokenize logic to convert the tokenize placeholder characters back to the multi-character strings.

This should be good enough "smoke test" to see if there might be value to add tokenize support to the main PCFG toolset.

Test 3 Training:

For this test, I'm finally updating my Tokenize code to the latest version in the John the Ripper Bleeding Jumbo github repository. To make things easier to compare against previous runs, I'll be training the new version Tok-OMEN on the RockYou1 1 million training subset. Here are the commands I used:

  1. Run tokenize.pl to generate the sed script and the external mode code:
    • ./tokenize.pl rockyou_1.txt
  2. Convert the training file to make use of tokens by piping it through the sed command. Note: The new tokenizer sed script also converts the output to JtR pot format.
    • IMPORTANT: Remove the s/^/:/ from the end of the sed command so the results will be saved as a password only file. This is because the pcfg trainer no longer has the ability to train from a potfile (long story: It was a buggy feature. I need to revisit this feature as it is useful).
    • cat rockyou_1.txt | sed 'REALLY_LONG_SED_COMMAND' > tok_omen_rockyou1.training
  3. Also copy the External mode script generated by tokenize.pl into your john-local.conf file as:
    • [List.External:Untokenize_omen]
  4. Run the PCFG trainer on the tokenized training file.
    • -t = Training file. -r = Rule name. -c 0.0 = Only generate OMEN guesses
    • python3 trainer.py -t tok_omen_rockyou1.training -r tok_omen -c 0.0

This appeared to work correctly as can be seen when I view the CP.level (basically 4 letter substrings OMEN uses for NGRAM=4) in Visual Studio Code:

Test 3 Design:

For a target set of passwords, I'm going to use the same RockYou32 1 million subset of passwords (different from the training passwords), and the LinkedIn password dump. This will allow me to directly compare this attack to the previous attacks I ran.

To test the attack and generate the first 25 guesses I used the following command. Please note: It is very important to pipe the result of the pcfg_guesser into JtR to make use of the associated untokenize_omen rule.

  •  python3 pcfg_guesser.py -r tok_omen | JohnTheRipper/run/john --stdin --external=untokenize_omen --stdout --max-candidates=25
The top 25 generated guesses were:
  1. ((Blank))
  2. k05icty
  3. butt
  4. 13babyg
  5. john
  6. 6543
  7. 5555
  8. b05ic1800ol1
  9. bigb
  10. 18chsadfg
  11. fu13cky
  12. p18chsasw
  13. blue
  14. h0708ok
  15. 13mommy
  16. s13coob
  17. budd
  18. augu
  19. 7777
  20. 6666
  21. sn13oop
  22. 07or765
  23. 13compu
  24. lu13cky
  25. mysg

Note: Since OMEN works using "levels" it doesn't generate the most probable guesses first. Instead it generates all guesses at a specific level first. So for level 1 there are 104 guesses that can be created with this training set. You can see the keyspace per level in the PCFG rules file under Omen/omen_keyspace.txt. Still, this looks weird, and [[Spoiler Alert]] indicated a deeper problem with this attack run.

To actually run (and record) the attack I used the following command for the RockYou32 test set:

  • python3 pcfg_guesser.py -r tok_omen | ./john --stdin --external=untokenize_omen --stdout | python3 checkpass.py -t rockyou_32.txt --max_guesses 5000000000 -o tokomen_rock1_rock32_stats.txt

Test 3 Results:

Test 3a Discussion:

I ended up not running the test against the LinkedIn passwords, because ... Yikes. The TokOMEN attack only cracked 20,418 passwords. I was actually expecting it to struggle based on the first 25 passwords it created, but this was way worse than I expected.

As a quick check, I ran two short attacks (10 million guesses). One was a "normal" TokOMEN attack as above, and the other one was an intentionally "broken" TokOMEN attack without using the JtR External mode. I expected the second "broken" attack to totally fail as the tokenized guesses will be "junk", but that would at least tell me if the JtR External mode was working. The results of this were the "normal" attack cracking 15,920 passwords and the "broken" attack cracking 14,934 passwords. So the JtR External mode appears to be working to a degree.

Still, not great. Thinking it might be an issue with the current NGRAM setting, I reran the PCFG trainer using -n 2 (NGRAM=2). That's when I looked at the trainer output and noticed something going horribly wrong...

TLDR: The file encoding autodetect was going south due to the new tokens in the training data. This caused the pcfg-trainer to horribly misread many of the training passwords. The real results are even worse than the errors suggested since there are a million total training passwords, so many of the different training passwords were also incorrectly "merged". No wonder things went so wrong! Teaches me to ignore error messages past-me put into the code.

Setting the encoding to UTF-8 or ASCII made ... negative progress. The PCFG trainer was rejecting most of the Tokenized training data. Looking at my code, I quickly realized the cause of most of these errors was my "dirty training dataset reject function". Basically the tokenized training data looks like all the "junk" that normally shows up in real password dumps. The PCFG trainer includes logic to reject these "junk" passwords to generate more effective rulesets for cracking real passwords. Below is an example of some of the logic that the PCFG trainer uses to clean up training datasets.

Removing the sanity checks from the training data helped a bit, but ended up causing a problem when the PCFG trainer was trying to save the ruleset and write the OMEN data to disk:

Messing around with various options, I was able to get "different" errors, but no successful training runs. So there currently isn't an easy way to apply the Tokenizer attack to the current PCFG OMEN code.

Based on this, I started looking at the Rub-SysSec OMEN code, but that had similar issues. These issues were also compounded by the fact that the Rub-SysSec OMEN alphabet was hardcoded. So there isn't an easy option with that toolset either.

Test 3 Conclusion:

There currently doesn't exist an easy way to apply the Tokenizer attack to current OMEN implementations. I think there is a lot of possibility to incorporate "variable Markov order" aka "variable NGRAM" functionality into OMEN. I'm not convinced that a Tokenizer trainer such as tokenize.pl is the best way to go about that though considering password length is a component in the OMEN level calculations. But I think the lessons learned by looking at tokenizer.pl and seeing it applied to JtR's Incremental mode can be applied to however this approach is incorporated into OMEN.

Question 4) When performing research (or running real cracking sessions), what is a "good" ruleset and dictionary to use?

This question was inspired by the following comment/question I received on the Hashcat Discord channel.

The follow-up discussion led to a really good conversation where @Br0ken shared their cracking techniques, rules, and wordlists they used. This is a good example of where I personally really benefit from writing these blog posts since I learn a lot from the comments/discussions they generate.

Because of that I wanted to share/document some of the advice and links that came out of that conversation.

Good Wordlists:

Discussion of Rulesets and Wordlists:

Future Research Ideas:

The list of topics keeps growing and growing. Here are new items to throw on my backlog:

  • Figure out how to integrate variable length NGrams into OMEN attacks
    • This is a big take-away from this blog post. I don't know if these attacks will be better than current OMEN implementations, but it'll certainly be interesting to see their results.
  • Fix up py-omen so it actually runs
    • I'm kind of tempted to spend time on this so I can experiment with OMEN without having to deal with the full PCFG toolset
  • Perform tests to identify good rulesets and dictionaries for password cracking attacks
    • It may be important to separate this into good rulesets for research vs. real cracking attacks. The reason for this is that many of the newer rules and dictionaries are based on well documented password lists like RockYou and LinkedIn. That's perfectly ok for real cracking sessions, but can cause problems for research when you want to use those datasets to test against.

Comments

Popular posts from this blog

Tool Deep Dive: PRINCE

The RockYou 32 Million Password List Top 100

Cracking the MySpace List - First Impressions