[Tutor] ascii codec cannot encode character

Fri Jan 28 03:25:36 CET 2011

Alex Hall wrote:
> Hello again:
> I have never seen this message before. I am pulling xml from a site's
> api and printing it, testing the wrapper I am writing for the api. I
> have never seen this error until just now, in the twelfth result of my
> search:
> UnicodeEncodeError: 'ASCII' codec can't encode character u'\u2019' in
> position 42: ordinal not in range(128)
> 
> I tried making the strings Unicode by saying something like
> self.title=unicode(data.find("title").text)
> but the same error appeared. I found the manual chapter on this, but I
> am not sure I want to ignore since I do not know what this character
> (or others) might mean in the string. I am not clear on what 'replace'
> will do. Any suggestions?

Short version
=============

You need to decode the bytes you get from the XML into unicode 
characters. You would do this using something like:

unicode(data.find("title").text, encoding='utf-8')

If that doesn't work, change utf-8 to another encoding. If the XML file 
tells you what the encoding should be, use that.

Alternatively, you could say:

unicode(data.find("title").text, errors='replace')

to substitute a "missing character" glyph for any undecodable bytes in 
the XML stream, or

unicode(data.find("title").text, errors='ignore')

to just ignore them.

Long version
============

You can't just say "turn these bytes into unicode" and expect it to 
magically work. Remember, in Python 2, so-called "strings" are actually 
strings of *bytes*, not characters. If you're a native English speaker, 
you've probably never needed to care about the distinction, but it is real.

When you have a string "spam", what that *really* is is a sequence of 
bytes 73 70 61 6D (in hexadecimal). By convention, Python uses the ASCII 
encoding map bytes to characters (e.g. hex 73 <=> "s"). That's not the 
only choice, but it has been the conventional choice for so long that 
people have forgotten that there are any other choices.

The problem with ASCII is that it only knows how to deal with 128 
different bytes, and about 30 of those are invisible control characters. 
The other 128 bytes don't mean anything in ASCII, and you can run into 
problems trying to deal with them as text.

There are hundreds of thousands of useful characters in the world, and 
only 128 ASCII ones. Prior to Unicode, people would choose their own 
preferred set of 256 useful characters, and semi-arbitrarily assign them 
to each of the 256 different bytes. Consequently there was a plethora of 
ad hoc encodings where a byte like (say) xC4  might represent (say)

'Ä' on Windows computers used in northern and western Europe
'─' on computers in Greece
'ƒ' on Macintosh computers in Western Europe
'ń' on Macintoshes in Eastern Europe

and so forth. As you can imagine, exchanging files from one machine to 
another was a nightmare. This is where Unicode comes in -- in theory, 
there is a Unicode character for every useful character in any language 
anywhere, including mathematical symbols, dingbats, ancient dead 
languages, pictograms, and more.

BUT files on disk, and in memory, are in bytes, not characters. You need 
some way to convert a character string into bytes, and back again. There 
are many different ways of doing so, depending on whether you care about 
making it as fast as possible, or as efficient as possible, or 
compatible with some pre-Unicode character set. And this is where the 
idea of encodings come in. You can see a list of supported encodings here:

http://docs.python.org/library/codecs.html#standard-encodings

So the idea is, when you have a stream of bytes (say, from reading from 
a disk), you have to *decode* those bytes into Unicode text, and to 
write that text back again, you have to *encode* it to bytes.

Now, Python tries to be very conservative: if you don't specify an 
encoding, it assumes you want ASCII, the lowest common denominator 
encoding that keeps English speakers happy. Lucky us. Until we have to 
deal with one or more bytes which can't be decoded into ASCII:

 >>> "\xC4".decode('ascii')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: 
ordinal not in range(128)

Python isn't going to guess what character you want byte C4 to 
represent. We've already seen there are at least four different choices. 
You have to tell it which one you mean:

 >>> print unicode("\xC4", encoding='macroman')
ƒ

Must-read article:
http://www.joelonsoftware.com/articles/Unicode.html

-- 
Steven