Unicode (UTF8) in dbhas on 2.5

Diez B. Roggisch deets at nospam.web.de
Mon Oct 20 10:04:50 EDT 2008


Yves Dorfsman wrote:

> Can you put UTF-8 characters in a dbhash in python 2.5 ?
> It fails when I try:
> 
>     #!/bin/env python
>     # -*- coding: utf-8 -*-
>     
>     import dbhash
>     
>     db = dbhash.open('dbfile.db', 'w')
>     db[u'smiley'] = u'☺'
>     db.close()
> 
> Do I need to change the bsd db library, or there is no way to make it work
> with python 2.5 ?

Please write the following program and meditate at least 30min in front of
it:

while True:
   print "utf-8 is not unicode"

Once this seemingly minor detail has sunken in, you are ready to work with
the below variant that will work:

#!/bin/env python
# -*- coding: utf-8 -*-
import dbhash
db = dbhash.open('dbfile.db', 'w')
db[u'smiley'.encode('utf-8')] = u'☺'.encode('utf-8')
db.close()


What is the difference? The dbhash module can only work with *bytestrings*.
Bytestrings are just that - a sequence of 8-bit-values.

u""-literals are *unicode objects*. These are an abstract sequence of
characters, smileys or others.

Now the real world of databases, network-connections and harddrives doesn't
know about unicode. They only know bytes. So before you can write to them,
you need to "encode" the unicode data to a byte-stream-representation.
There are quite a few of these, e.g. latin1, or the aforementioned UTF-8,
which has the property that it can render *all* unicode characters,
potentially needing more than one byte per character.

Which is why the code above has those encode-calls on the unicode-objects.

But beware! Once you encoded the data, there is no way to *know* it's
encoding. So when reading the data, you will get *bytestrings*. So you need
to "decode" them, with the proper encoding. In this case, again utf-8.

Which brings us to the second part of the program:

db = dbhash.open('dbfile.db')
smiley = db[u'smiley'.encode('utf-8')].decode('utf-8')

print smiley.encode('utf-8')


The last encode is there to print out the smiley on a terminal - one of
those pesky bytestream-eaters that don't know about unicode.

Diez



More information about the Python-list mailing list