Tuesday, April 27, 2010

Oh, Base64 MIME, why did you do it wrong?

Base64 MIME encoding, used in email, some URLs, and probably other contexts I'm forgetting, uses a mapping that is unintuitive and inconsistent with hexadecimal.

Hexadecimal places the digits (0-9) at the front of the symbol range used, causing their numeric value to be exactly what one would expect; user-friendly and intuitive. For reasons unclear, the Base64 MIME spec places them near the end, giving them unintuitive values of 52 through 61. This makes it fundamentally not an extension of the approach used in hexadecimal, upon which it is easy to build by simply adding letters. For instance, if we add G the normal hexadecimal range of 0-F we can represent base 17, and by the same principle we can represent any base for which we have at sufficient characters available to represent it. And so, base 64 could have been 0-9A-Za-z+/= rather than A-Za-z0-9+/= as it is.

In case you're wondering why I care about this, some time ago I wrote (in Perl) a numeric base conversion application, and I was able to construct the appropriate base mapping automatically by building up using the same approach embodied by hexadecimal. Base64 MIME, however, effectively scrambles this mapping for apparently no good reason.

I like the encoding approach of Base64 MIME (4 6-bit bytes <==> 3 8-bit bytes), but the choice of mapping was arrived at with little consideration for its precursors or normal human expectations (yes, I realize it was written largely for the benefit of machines).

See Base64 for an explanation of the actual spec.