[Tutor] String encoding
Steven D'Aprano
steve at pearwood.info
Fri Aug 26 03:34:16 CEST 2011
Prasad, Ramit wrote:
>> I don't know what they are from but they are both the same value,
>> one in hex and one in octal.
>>
>> 0xC9 == 0311
>>
>> As for the encoding mechanisms I'm afraid I can't help there!
>
> Nice catch! Yeah, I am stuck on the encoding mechanism as well. I
> know how to encode/decode...but not what encoding to use. Is there a
> reference that I can look up to find what encoding that would
> correspond to? I know what the character looks like if that helps. I
> know that Python does display the correct character sometimes, but
> not sure when or why.
In general, no. The same byte value (0xC9) could correspond to many
different encodings. In general, you *must* know what the encoding is in
order to tell how to decode the bytes.
Think about it this way... if I gave you a block of data as hex bytes:
240F91BC03...FF90120078CD45
and then asked you whether that was a bitmap image or a sound file or
something else, how could you tell? It's just *bytes*, it could be anything.
All is not quite lost though. You could try decoding the bytes and see
what you get, and see if it makes sense. Start with ASCII, Latin-1,
UTF-8, UTF-16 and any other encodings in common use. (This would be like
pretending the bytes were a bitmap, and looking at it, and trying to
decide whether it looked like an actual picture or like a bunch of
random pixels. Hopefully it wasn't meant to look like a bunch of random
pixels.)
Web browsers such as Internet Explorer and Mozilla will try to guess the
encoding by doing frequency analysis of the bytes. Mozilla's encoding
guesser has been ported to Python:
http://chardet.feedparser.org/
But any sort of guessing algorithm is just a nasty hack. You are always
better off ensuring that you accurately know the encoding.
--
Steven
More information about the Tutor
mailing list