[Tutor] ascii codec cannot encode character
Steven D'Aprano
steve at pearwood.info
Fri Jan 28 03:25:36 CET 2011
Alex Hall wrote:
> Hello again:
> I have never seen this message before. I am pulling xml from a site's
> api and printing it, testing the wrapper I am writing for the api. I
> have never seen this error until just now, in the twelfth result of my
> search:
> UnicodeEncodeError: 'ASCII' codec can't encode character u'\u2019' in
> position 42: ordinal not in range(128)
>
> I tried making the strings Unicode by saying something like
> self.title=unicode(data.find("title").text)
> but the same error appeared. I found the manual chapter on this, but I
> am not sure I want to ignore since I do not know what this character
> (or others) might mean in the string. I am not clear on what 'replace'
> will do. Any suggestions?
Short version
=============
You need to decode the bytes you get from the XML into unicode
characters. You would do this using something like:
unicode(data.find("title").text, encoding='utf-8')
If that doesn't work, change utf-8 to another encoding. If the XML file
tells you what the encoding should be, use that.
Alternatively, you could say:
unicode(data.find("title").text, errors='replace')
to substitute a "missing character" glyph for any undecodable bytes in
the XML stream, or
unicode(data.find("title").text, errors='ignore')
to just ignore them.
Long version
============
You can't just say "turn these bytes into unicode" and expect it to
magically work. Remember, in Python 2, so-called "strings" are actually
strings of *bytes*, not characters. If you're a native English speaker,
you've probably never needed to care about the distinction, but it is real.
When you have a string "spam", what that *really* is is a sequence of
bytes 73 70 61 6D (in hexadecimal). By convention, Python uses the ASCII
encoding map bytes to characters (e.g. hex 73 <=> "s"). That's not the
only choice, but it has been the conventional choice for so long that
people have forgotten that there are any other choices.
The problem with ASCII is that it only knows how to deal with 128
different bytes, and about 30 of those are invisible control characters.
The other 128 bytes don't mean anything in ASCII, and you can run into
problems trying to deal with them as text.
There are hundreds of thousands of useful characters in the world, and
only 128 ASCII ones. Prior to Unicode, people would choose their own
preferred set of 256 useful characters, and semi-arbitrarily assign them
to each of the 256 different bytes. Consequently there was a plethora of
ad hoc encodings where a byte like (say) xC4 might represent (say)
'Ä' on Windows computers used in northern and western Europe
'─' on computers in Greece
'ƒ' on Macintosh computers in Western Europe
'ń' on Macintoshes in Eastern Europe
and so forth. As you can imagine, exchanging files from one machine to
another was a nightmare. This is where Unicode comes in -- in theory,
there is a Unicode character for every useful character in any language
anywhere, including mathematical symbols, dingbats, ancient dead
languages, pictograms, and more.
BUT files on disk, and in memory, are in bytes, not characters. You need
some way to convert a character string into bytes, and back again. There
are many different ways of doing so, depending on whether you care about
making it as fast as possible, or as efficient as possible, or
compatible with some pre-Unicode character set. And this is where the
idea of encodings come in. You can see a list of supported encodings here:
http://docs.python.org/library/codecs.html#standard-encodings
So the idea is, when you have a stream of bytes (say, from reading from
a disk), you have to *decode* those bytes into Unicode text, and to
write that text back again, you have to *encode* it to bytes.
Now, Python tries to be very conservative: if you don't specify an
encoding, it assumes you want ASCII, the lowest common denominator
encoding that keeps English speakers happy. Lucky us. Until we have to
deal with one or more bytes which can't be decoded into ASCII:
>>> "\xC4".decode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
ordinal not in range(128)
Python isn't going to guess what character you want byte C4 to
represent. We've already seen there are at least four different choices.
You have to tell it which one you mean:
>>> print unicode("\xC4", encoding='macroman')
ƒ
Must-read article:
http://www.joelonsoftware.com/articles/Unicode.html
--
Steven
More information about the Tutor
mailing list