[Tutor] ascii codec cannot encode character

Fri Jan 28 04:02:25 CET 2011

On 1/27/11, Steven D'Aprano <steve at pearwood.info> wrote:
> Alex Hall wrote:
>> Hello again:
>> I have never seen this message before. I am pulling xml from a site's
>> api and printing it, testing the wrapper I am writing for the api. I
>> have never seen this error until just now, in the twelfth result of my
>> search:
>> UnicodeEncodeError: 'ASCII' codec can't encode character u'\u2019' in
>> position 42: ordinal not in range(128)
>>
>> I tried making the strings Unicode by saying something like
>> self.title=unicode(data.find("title").text)
>> but the same error appeared. I found the manual chapter on this, but I
>> am not sure I want to ignore since I do not know what this character
>> (or others) might mean in the string. I am not clear on what 'replace'
>> will do. Any suggestions?
>
> Short version
> =============
>
> You need to decode the bytes you get from the XML into unicode
> characters. You would do this using something like:
>
> unicode(data.find("title").text, encoding='utf-8')
>
> If that doesn't work, change utf-8 to another encoding. If the XML file
> tells you what the encoding should be, use that.
>
> Alternatively, you could say:
>
> unicode(data.find("title").text, errors='replace')
>
> to substitute a "missing character" glyph for any undecodable bytes in
> the XML stream, or
>
> unicode(data.find("title").text, errors='ignore')
>
> to just ignore them.
I tried both of those and got a different error. I have since fixed it
so I no longer have the exact text, but it was something about not
supporting convertion from unicode. I finally ended up doing this:
self.title=data.find("title").text.encode("utf-8")
and it seems happy enough, though I get odd characters above 128. I
suppose it is better than a traceback, and I suspect I just have the
wrong character set. Still, I found it very odd that unicode(string,
errors='replace') threw an exception.
>
>
> Long version
> ============
>
> You can't just say "turn these bytes into unicode" and expect it to
> magically work. Remember, in Python 2, so-called "strings" are actually
> strings of *bytes*, not characters. If you're a native English speaker,
> you've probably never needed to care about the distinction, but it is real.
>
> When you have a string "spam", what that *really* is is a sequence of
> bytes 73 70 61 6D (in hexadecimal). By convention, Python uses the ASCII
> encoding map bytes to characters (e.g. hex 73 <=> "s"). That's not the
> only choice, but it has been the conventional choice for so long that
> people have forgotten that there are any other choices.
>
> The problem with ASCII is that it only knows how to deal with 128
> different bytes, and about 30 of those are invisible control characters.
> The other 128 bytes don't mean anything in ASCII, and you can run into
> problems trying to deal with them as text.
>
> There are hundreds of thousands of useful characters in the world, and
> only 128 ASCII ones. Prior to Unicode, people would choose their own
> preferred set of 256 useful characters, and semi-arbitrarily assign them
> to each of the 256 different bytes. Consequently there was a plethora of
> ad hoc encodings where a byte like (say) xC4  might represent (say)
>
> 'Ä' on Windows computers used in northern and western Europe
> '─' on computers in Greece
> 'ƒ' on Macintosh computers in Western Europe
> 'ń' on Macintoshes in Eastern Europe
>
> and so forth. As you can imagine, exchanging files from one machine to
> another was a nightmare. This is where Unicode comes in -- in theory,
> there is a Unicode character for every useful character in any language
> anywhere, including mathematical symbols, dingbats, ancient dead
> languages, pictograms, and more.
>
> BUT files on disk, and in memory, are in bytes, not characters. You need
> some way to convert a character string into bytes, and back again. There
> are many different ways of doing so, depending on whether you care about
> making it as fast as possible, or as efficient as possible, or
> compatible with some pre-Unicode character set. And this is where the
> idea of encodings come in. You can see a list of supported encodings here:
>
> http://docs.python.org/library/codecs.html#standard-encodings
>
> So the idea is, when you have a stream of bytes (say, from reading from
> a disk), you have to *decode* those bytes into Unicode text, and to
> write that text back again, you have to *encode* it to bytes.
>
> Now, Python tries to be very conservative: if you don't specify an
> encoding, it assumes you want ASCII, the lowest common denominator
> encoding that keeps English speakers happy. Lucky us. Until we have to
> deal with one or more bytes which can't be decoded into ASCII:
>
>  >>> "\xC4".decode('ascii')
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
> ordinal not in range(128)
>
> Python isn't going to guess what character you want byte C4 to
> represent. We've already seen there are at least four different choices.
> You have to tell it which one you mean:
>
>  >>> print unicode("\xC4", encoding='macroman')
> ƒ
>
>
> Must-read article:
> http://www.joelonsoftware.com/articles/Unicode.html

A very interesting explanation! Thanks.
>
>
>
>
>
> --
> Steven
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>


-- 
Have a great day,
Alex (msg sent from GMail website)
mehgcap at gmail.com; http://www.facebook.com/mehgcap