Sunday, November 2, 2008

How do you say "Excuse Me" in Canadian?

As a follow up to my earlier post on secure coding, I just wanted to talk about another thing that has been giving me fits: Coding foreign language support.

Normally this isn't something most people have to deal with, but a lot of the password lists I'm parsing and cracking are non-English. First some background. The standard default scheme to hold character information, (such as an 'a') is ASCII otherwise known as the "American Standard Code for Information Interchange" (might show up in a game of Trivial Pursuit). As you can tell from its name, ASCII wasn't designed to be able to represent non-English characters. In fact, it can only represent 128 different characters, including control characters (such as return, space, etc). To add support for multiple languages, (while being backward compatible with ASCII), another standard was developed. It's called UTF-8, otherwise known as Unicode Transformation Format, (You'll never need to know the full name for that). Besides being able to represent the ASCII format, it also theoretically has the ability to encode up to a little over 4 billion different character sets, (naming conventions limit it to around a million). Now we're talking.

One problem though is that an ASCII character takes up 1 byte of data, (ok only 7 bits, but everyone represents it as 8 bits so get over it). UTF-8 is variable length and can range from 1 to 4 bytes of data. This means that updating old programs is a pain because if they use any pointer arithmetic you're hosed. Also, I've often found myself having to update the character handling throughout the entire program as I would find it passing 1 byte chunks of UTF-8 data as separate characters to the "logic" part. Then there are all the built in functions, such as ischar(), and toupper() that can completely break when passed UTF-8 data. I have to say though, if those functions always broke that would actually be nice. The problem is, whether they work or not can depend on your system setup which makes portability a pain, "It must be a user error, it works on MY computer".

This can really rear its ugly head when you are using someone else's programs. A good example, I was just dealing with a python script that parses text lists. Of course it didn't work originally, but I added the header


and still no luck. It wasn't until I initilized my string variable with


that it finally realized it was supposed to read UTF-8 encoded variables. It wasn't that big of a deal, but this was a trivial little script.

I don't have any solutions. This is mostly just a warning. I won't even go into the security implications of the different encoding schemes especially when it comes to web security, "Oh, you are blocking my SQL injection? We'll I'll just encode it differently and your checker won't flag it". If you are interested in that stuff though I highly recommend you check out