[Tutor] String encoding

Jerry Hill malaclypse2 at gmail.com
Fri Aug 26 17:23:14 CEST 2011


On Thu, Aug 25, 2011 at 7:07 PM, Prasad, Ramit
<ramit.prasad at jpmorgan.com> wrote:
> Nice catch! Yeah, I am stuck on the encoding mechanism as well. I know how to encode/decode...but not what encoding to use. Is there a reference that I can look up to find what encoding that would correspond to? I know what the character looks like if that helps. I know that Python does display the correct character sometimes, but not sure when or why.

In this case, the encoding is almost certainly "latin-1".  I know that
from playing around at the interactive interpreter, like this:

>>> s = 'M\xc9XICO'
>>> print s.decode('latin-1')
MÉXICO

If you want to see charts of various encodings, wikipedia has a bunch.
 For instance, the Latin-1 encoding is here:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1 and UTF-8 is here:
http://en.wikipedia.org/wiki/Utf-8

As the other respondents have said, it's really hard to figure this
out just in code.  The chardet module mentioned by Steven D'Aprano is
probably the best bet if you really *have* to guess the encoding of an
arbitrary sequence of bytes, but it much, much better to actually know
the encoding of your inputs.

Good luck!

-- 
Jerry


More information about the Tutor mailing list