[Tutor] String encoding

Fri Aug 26 19:04:58 CEST 2011

Prasad, Ramit wrote:
>> Think about it this way... if I gave you a block of data as hex
>> bytes:
>> 
>> 240F91BC03...FF90120078CD45
>> 
>> and then asked you whether that was a bitmap image or a sound file
>> or something else, how could you tell? It's just *bytes*, it could
>> be anything.
> 
> Yes, but if you give me data and then tell me it is a sound file then
> I might be able to reverse engineer or reconstruct it. I know what
> the character does/should look like. I just need the equivalent to
> the ASCII table for the various encodings; once I have the table I
> can compare different characters at \311 and see if they are the
> correct character. I have not been able to find an encoding table
> (other than ASCII).

In practice, you can often guess the encoding by trying the most common 
ones (such as Latin-1 and UTF-8) and seeing if the strings you get make 
sense.

But note that more than one encoding may give sensible results for a 
specific string:

 >>> b = 'M\311XICO'  # byte-string
 >>> print b.decode('latin-1')
MÉXICO
 >>> print b.decode('iso 8859-9')  # Turkish
MÉXICO

So was M\311XICO encoded using the Latin-1 or Turkish encoding, or 
something else? There is no way to tell. Many encodings overlap.

If you have arbitrary byte-strings, and no context to tell what makes 
sense, then all bets are off. Just because something *can* be decoded 
doesn't make it meaningful:

 >>> b = '...\xf7...'
 >>> print b.decode('macroman')
...˜...
 >>> print b.decode('latin-1')
...÷...

Which is the right encoding to use and which string is intended?

So guessing can sometimes work, but guesses can be wrong because 
encodings overlap. In general, you must know the encoding to be sure. 
But if you have to guess, try to guess using the largest byte-string 
that you can.

Python 2.7 comes with 108 encodings:

http://docs.python.org/library/codecs.html#standard-encodings

Since anyone can define their own encoding, there is no upper limit to 
the number of encodings, and no promise that Python will include them 
all. There are even two joke encodings, invented for April's Fool Day, 
that use nine-bit nonets instead of eight-bit octets (bytes): UTF-9 and 
UTF-18.

-- 
Steven