[Tutor] Weird Unicode encode/decode errors in Python 2
Steven D'Aprano
steve at pearwood.info
Sat Dec 8 19:02:47 EST 2018
This is not a request for help, but a demonstration of what can go wrong
with text processing in Python 2.
Following up on the "Special characters" thread, one of the design flaws
of Python 2 is that byte strings and text strings offer BOTH decode and
encode methods, even though only one is meaningful in each case.[1]
- text strings are ENCODED to bytes;
- byte are DECODED to text strings.
One of the symptoms of getting it wrong is when you take a Unicode text
string and encode/decode it but get an error from the *opposite*
operation:
py> u'ä'.decode('latin1')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
position 0: ordinal not in range(128)
Look at what happens: I try to DECODE a string, but get an ENCODE error.
And even though I specified Latin 1 as the codec, Python uses ASCII.
What is going on here?
Behind the scenes, the interpreter takes my text u'ä' (a Unicode string)
and attempts to *encode* it to bytes first, using the default ASCII
codec. That fails. Had it succeeded, it would have then attempted to
*decode* those bytes using Latin 1.
Similarly:
py> b = u'ä'.encode('latin1')
py> print repr(b)
'\xe4'
py> b.encode('latin1')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0:
ordinal not in range(128)
The error here is that I tried to encode a bunch of bytes, instead of
decoding them. But the insidious thing about this error is if you are
working with pure ASCII, it seems to work:
py> 'ascii'.encode('utf-16')
'\xff\xfea\x00s\x00c\x00i\x00i\x00'
That is, it *seems* to work because there's no error, but the result is
pretty much meaningless: I *intended* to get a UTF-16 Unicode string,
but instead I ended up with bytes just like I started with.
Python 3 fixes this bug magnet by removing the decode method from
Unicode text strings, and the encode method from byte-strings.
[1] Technically this is not so, as there are codecs which can be used to
convert bytes to bytes, or text to text. But the vast majority of common
cases, codecs are used to convert bytes to text and vice versa. For the
rare exception, we can use the "codecs" module.
--
Steve
More information about the Tutor
mailing list