Binary strings, unicode and encodings
Peter Hansen
peter at engcorp.com
Fri Jan 16 09:28:31 EST 2004
Laurent Therond wrote:
>
> I used the interpreter on my system:
> >>> c = StringIO()
> >>> c.write('%d:%s' % (len('stringé'), 'stringé'))
> >>> print c.getvalue()
> 7:stringé
>
> OK
>
> Did StringIO just recognize Extended ASCII?
> Did StringIO just recognize ISO 8859-1?
>
> é belongs to Extended ASCII AND ISO 8859-1.
No, StringIO didn't "recognize" anything but a simple string. There is
no issue of codecs and encoding and such going on here, because you are
sending in a string (as it happens, one that's not 8-bit clean, but that's
irrelevant though it may be the cause of your confusion) and getting out
a string. StringIO does not make any attempt to "encode" something that
is already a string.
> >>> print c.getvalue().decode('US-ASCII')
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position 8: ordinal
> not in range(128)
>
> >>> print c.getvalue().decode('ISO-8859-1')
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> File "C:\Python23\lib\encodings\cp437.py", line 18, in encode
> return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode character u'\x82' in position 8
> : character maps to <undefined>
> >>>
>
> OK
>
> It must have been Extended ASCII, then.
Hmm... note that when you are trying to decode that string, you are
attempting to print a unicode rather than a string. When you try to
print that on your console, the console must decode it using the default
encoding again. I think you know this, but in case you didn't: it explains
why you got a DecodeError in the first place, but an EncodeError in the
second. The second example worked, treating the string as having been
encoded using ISO-8859-1, and returns a unicode. If you had assigned
it instead of printing it, you should have seen now errors.
-Peter
More information about the Python-list
mailing list