Binary strings, unicode and encodings

Peter Hansen peter at engcorp.com
Fri Jan 16 15:28:31 CET 2004


Laurent Therond wrote:
> 
> I used the interpreter on my system:
> >>> c = StringIO()
> >>> c.write('%d:%s' % (len('stringé'), 'stringé'))
> >>> print c.getvalue()
> 7:stringé
> 
> OK
> 
> Did StringIO just recognize Extended ASCII?
> Did StringIO just recognize ISO 8859-1?
> 
> é belongs to Extended ASCII AND ISO 8859-1.

No, StringIO didn't "recognize" anything but a simple string.  There is
no issue of codecs and encoding and such going on here, because you are
sending in a string (as it happens, one that's not 8-bit clean, but that's
irrelevant though it may be the cause of your confusion) and getting out
a string.  StringIO does not make any attempt to "encode" something that
is already a string.

> >>> print c.getvalue().decode('US-ASCII')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position 8: ordinal
> not in range(128)
> 
> >>> print c.getvalue().decode('ISO-8859-1')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
>   File "C:\Python23\lib\encodings\cp437.py", line 18, in encode
>     return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode character u'\x82' in position 8
> : character maps to <undefined>
> >>>
> 
> OK
> 
> It must have been Extended ASCII, then.

Hmm... note that when you are trying to decode that string, you are
attempting to print a unicode rather than a string.  When you try to
print that on your console, the console must decode it using the default
encoding again.  I think you know this, but in case you didn't: it explains
why you got a DecodeError in the first place, but an EncodeError in the
second.  The second example worked, treating the string as having been
encoded using ISO-8859-1, and returns a unicode.  If you had assigned
it instead of printing it, you should have seen now errors.

-Peter



More information about the Python-list mailing list