[Tutor] String encoding

Steven D'Aprano steve at pearwood.info
Fri Aug 26 03:34:16 CEST 2011


Prasad, Ramit wrote:
>> I don't know what they are from but they are both the same value,
>> one in hex and one in octal.
>> 
>> 0xC9 == 0311
>> 
>> As for the encoding mechanisms I'm afraid I can't help there!
> 
> Nice catch! Yeah, I am stuck on the encoding mechanism as well. I
> know how to encode/decode...but not what encoding to use. Is there a
> reference that I can look up to find what encoding that would
> correspond to? I know what the character looks like if that helps. I
> know that Python does display the correct character sometimes, but
> not sure when or why.

In general, no. The same byte value (0xC9) could correspond to many 
different encodings. In general, you *must* know what the encoding is in 
order to tell how to decode the bytes.

Think about it this way... if I gave you a block of data as hex bytes:

240F91BC03...FF90120078CD45

and then asked you whether that was a bitmap image or a sound file or 
something else, how could you tell? It's just *bytes*, it could be anything.

All is not quite lost though. You could try decoding the bytes and see 
what you get, and see if it makes sense. Start with ASCII, Latin-1, 
UTF-8, UTF-16 and any other encodings in common use. (This would be like 
pretending the bytes were a bitmap, and looking at it, and trying to 
decide whether it looked like an actual picture or like a bunch of 
random pixels. Hopefully it wasn't meant to look like a bunch of random 
pixels.)

Web browsers such as Internet Explorer and Mozilla will try to guess the 
encoding by doing frequency analysis of the bytes. Mozilla's encoding 
guesser has been ported to Python:

http://chardet.feedparser.org/

But any sort of guessing algorithm is just a nasty hack. You are always 
better off ensuring that you accurately know the encoding.


-- 
Steven


More information about the Tutor mailing list