Unicode Question

"Martin v. Löwis" martin at v.loewis.de
Mon Jan 9 20:12:51 EST 2006


David Pratt wrote:
> I want to prepare strings for db storage that come from normal Windows
> machine (cp1252) so my understanding is to unicode and encode to utf-8
> and to store properly.

That also depends on the database. The database must accept
UTF-8-encoded strings, and must not modify them in any form or way.
Some databases fail here, and work better if you pass Unicode objects
to them directly.

> Since data will be used on the web I would not
> have to change my encoding when extracting from the database. This first
> example I believe simulates this with the 3/4 symbol. Here I want tox
> store '\xc2\xbe' in my database.
> 
>>>> tq = u'\xbe'

You can verify that this is really 3/4:

py> import unicodedata
py> unicodedata.name(u"\xbe")
'VULGAR FRACTION THREE QUARTERS'

>>>> tq_utf = tq.encode('utf8')
>>>> tq, tq_utf
> (u'\xbe', '\xc2\xbe')

So it should be clear now that '\xc2\xbe' is the UTF-8 encoding
of that character.

> To unicode withat a valiable, my understanding is that I can unicode and
> encode at the same time

Not sure what you mean by "same time" (I'm not even sure what
"I can unicode" means - unicode is not a verb, it's a noun).

>>>> tq = '\xbe'
>>>> tq_utf = unicode(tq, 'utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xbe in position 0:
> unexpected code byte
> 
> This is not working for me. Can someone explain why. Many thanks.

Of course not. The UTF-8 encoding of the character, as we have seen
earlier, is '\xc2\xbe'. So you should write

py> unicode('\xc2\xbe', 'utf-8')
u'\xbe'

You mentioned windows-1252 at some point. If you are given windows-1252
bytes, you can do

py> unicode('\xbe', 'windows-1252')
u'\xbe'

If you are looking for "at the same time", perhaps this is also
interesting:

py> unicode('\xbe', 'windows-1252').encode('utf-8')
'\xc2\xbe'

Regards,
Martin



More information about the Python-list mailing list