Unicode (UTF8) in dbhas on 2.5

"Martin v. Löwis" martin at v.loewis.de
Tue Oct 21 22:39:52 CEST 2008

Paul Boddie wrote:
> On 20 Okt, 16:04, "Diez B. Roggisch" <de... at nospam.web.de> wrote:
>> What is the difference? The dbhash module can only work with *bytestrings*.
>> Bytestrings are just that - a sequence of 8-bit-values.
> Sounds like a prime candidate for some improvement work. Patches,
> anyone? ;-)

It's not possible to "fix" this - it isn't even broken. The *db modules,
by design, support storing of arbitrary bytes, not just character data.
You can put images into them, or sound files, java byte code files, etc.
So if Python would assume they have to be UTF-8 encoded character
strings, it would severely limit the usability of these modules.

For keys, things are slightly different from values - there is a higher
chance that the keys are indeed intended to be character strings.
However, in the bsddb btree format, any byte sequence that has a good
lexical order can be used as a key, and people do use the interface
that way (e.g. by putting an md-5 hash as the key, and the original data
as the value).

It would be possible to put a layer on top of them which assumes that
either keys, values, or both are characters, and that they are further
UTF-8 encoded. However, such a package doesn't need to be part of the
standard library.


More information about the Python-list mailing list