Sunday, May 31, 2009

Character Frequency Analysis Info

Sorry for the long delay between postings. Between the IEEE Security & Privacy conference and moving back up to D.C. for the summer my computer time has been limited.

Well enough whining from me.  On to the data!

With character frequency analysis, I normally divide it up into three sections, first letter analysis, last letter analysis and overall analysis, (I probably should do middle as well, but I've found it closely mirrors overall analysis). There are a couple of reasons for this distinction:

1) They do vary quite a bit.  People tend to capitalize the first letter much more often than any other letter. Also people tend to put numbers at the end of passwords. You get the idea.

2)While I like using Markov models, (they track the conditional probability of letters appearing together, for example if you have a 'q' the next letter will almost always be a 'u'), they can be a pain sometimes to set up. In that case using letter frequency analysis  greatly helps when performing targeted brute force attacks. I can't stress this enough.  If you are using A-Z0-9 in your john the ripper config files, (or god forbid the default cain&able character sets), you are really hurting yourself. This way when I'm doing brute force, I'll use the first letter analysis to order my character set for the first character, last letter analysis to order the last character set, etc.

2-b) If anyone is interested I can throw up a table showing the Markov probabilities. Normally I just cheat though and use John the Ripper's built in Markov models. You can train it yourself by passing JtR a list of passwords which is really nice.  Eventually I should post my JtR character set file, (Markov probabilities for their "incremental" <-read brute force, attack), but if someone wants it sooner let me know and I can e-mail it to you.

3) The first character analysis can also be very useful when attacking pass-phrases where people only used the first letter of each word.

So here is the data. All data is for the phpbb.com list. (Edit I'm having trouble pasting in some of the non-ascii characters into blogger. The 'æ' is actually an 'i' with two dots).

Quick cheat sheet:

Overall Character Frequency Charset:
aeorisn1tl2md0cp3hbuk45g9687yfwjvzxqASERBTMLNPOIDCHGKFJUW.!Y*@V-ZQX_$#,/+?;^ %~=&`\)][:<(æ>"ü|{'öä}

First Character Frequency Charset:
s1mpabctdrlfhgkjnw2ei0ov3q45796z8yuxSMPBACDTJRLFGHKNEWVIOZUQY!X*@.$#_-`[,~=/^<+\?;%{]:(&

Last Character Frequency Charset:
e1nsra326yt0d954o78lkgmihbpcxuwfzjvq!ESANRDYBT.O*LHMGKCX@PI$#U-ZWFJ_Q?+^V/,;)%~=`]&æ\>:"}[

Now for the actual percentages:

Overall Character Frequecy Analysis (letter/probability):

a       7.52766

e       7.0925

o       5.17

r       4.96032

i       4.69732

s       4.61079

n       4.56899

1       4.35053

t       3.87388

l       3.77728

2       3.12312

m       2.99913

d       2.76401

0       2.74381

c       2.57276

p       2.45578

3       2.43339

h       2.41319

b       2.29145

u       2.10191

k       1.96828

4       1.94265

5       1.88577

g       1.85331

9       1.79558

6       1.75647

8       1.66225

7       1.621

y       1.52483

f       1.2476

w       1.24492

j       0.836677

v       0.833626

z       0.632558

x       0.573305

q       0.346119

A       0.130466

S       0.108132

E       0.0970865

R       0.08476

B       0.0806715

T       0.0801223

M       0.0782306

L       0.0775594

N       0.0748134

P       0.073715

O       0.0729217

I       0.070908

D       0.0698096

C       0.0660872

H       0.0544319

G       0.0497332

K       0.0460719

F       0.0417393

J       0.0363083

U       0.0350268

W       0.0320367

.       0.0316706

!       0.0306942

Y       0.0255073

*       0.0241648

@       0.0238597

V       0.0235546

-       0.0197712

Z       0.0170252

Q       0.0147064

X       0.0142182

_       0.0122655

$       0.00970255

#       0.00854313

,       0.00323418

/       0.00311214

+       0.00231885

?       0.00207476

;       0.00207476

^       0.00195272

        0.00189169

%       0.00170863

~       0.00152556

=       0.00140351

&       0.00134249

`       0.00115942

\       0.00115942

)       0.00115942

]       0.0010984

[       0.0010984

:       0.000549201

<       0.000427156

(       0.000427156

æ       0.000183067

>       0.000183067

"       0.000183067

ü       0.000122045

|       0.000122045

{       0.000122045

'       0.000122045

ö       6.10223e-05

ä       6.10223e-05

}       6.10223e-0


----------------------------------------

First Character Frequecy Analysis:

s       7.55118

1       6.26416

m       6.16403

p       6.0229

a       5.17827

b       4.96031

c       4.85069

t       4.37507

d       4.15582

r       3.11136

l       3.09842

f       3.06432

h       2.99915

g       2.96764

k       2.9124

j       2.84766

n       2.53389

w       2.26717

2       2.11309

e       1.91844

i       1.77903

0       1.76004

o       1.33104

v       1.22573

3       1.11179

q       1.07467

4       1.02461

5       0.957276

7       0.918433

9       0.906348

6       0.883905

z       0.871821

8       0.85542

y       0.705225

u       0.56021

x       0.518345

S       0.360813

S       0.360813

M       0.296074

P       0.282263

B       0.256799

A       0.24644

C       0.237809

D       0.227019

T       0.218387

J       0.183428

R       0.179112

L       0.173501

F       0.166164

G       0.162711

H       0.153648

K       0.143289

N       0.114804

E       0.101856

W       0.100562

V       0.0828661

I       0.0820029

O       0.0599916

Z       0.0474754

U       0.0392751

Q       0.0388435

Y       0.0332328

!       0.0258957

X       0.0224429

*       0.0220113

@       0.0202849

.       0.0151058

$       0.013811

#       0.0120846

_       0.00517913

-       0.00474754

`       0.00302116

[       0.00302116

,       0.00302116

~       0.00258957

=       0.00215797

/       0.00215797

^       0.00172638

<       0.00172638

+       0.00172638

\       0.00129478

?       0.00129478

;       0.00129478

%       0.00129478

{       0.000863189

]       0.000863189

:       0.000863189

(       0.000863189

&       0.000431594


----------------------------------------

Last Character Frequecy Analysis:

e       7.34531

1       6.7933

n       5.81012

s       5.6513

r       5.35566

a       5.1869

3       4.59734

2       3.91327

6       3.77602

y       3.59302

t       3.51404

0       3.48167

d       3.3004

9       3.07425

5       2.96031

4       2.9288

o       2.91887

7       2.88262

8       2.62323

l       2.45146

k       1.75918

g       1.66552

m       1.63747

i       1.6038

h       1.54209

b       1.32154

p       1.12258

c       1.06474

x       1.06043

u       0.848515

w       0.726805

f       0.69832

f       0.69832

z       0.612864

j       0.317654

v       0.277947

q       0.220976

!       0.130342

E       0.0975403

S       0.08934

A       0.08934

N       0.0815713

R       0.0694867

D       0.0604232

Y       0.0535177

B       0.0487702

T       0.0448858

.       0.0444542

O       0.0431594

*       0.0384119

L       0.0345276

H       0.0332328

M       0.02978

G       0.0293484

K       0.0250325

C       0.0250325

X       0.0241693

@       0.0237377

P       0.0228745

I       0.0220113

$       0.0198533

#       0.0185586

U       0.015969

-       0.0146742

Z       0.0142426

W       0.0142426

F       0.013811

J       0.0107899

_       0.00949508

Q       0.00733711

?       0.00733711

+       0.00733711

^       0.00474754

V       0.00474754

/       0.00474754

,       0.00431594

;       0.00388435

)       0.00388435

%       0.00388435

~       0.00302116

=       0.00302116

`       0.00215797

]       0.00172638

&       0.00172638

æ       0.000863189

\       0.000863189

>       0.000863189

:       0.000863189

"       0.000863189

}       0.000431594

[       0.000431594