[Python-3000] Immutable bytes type and dbm modules

Tue Aug 7 04:47:37 CEST 2007

On 6-Aug-07, at 7:06 PM, Guido van Rossum wrote:

> On 8/6/07, Mike Klaas <mike.klaas at gmail.com> wrote:

>> For instance, it is quite common to use integers as keys.  If you are
>> inserting keys in order, it is about a hundred times faster to encode
>> the ints in big-endian byte order than than little-endian:
>
> I'm assuming that this speed difference says something about the
> implementation of the underlying dbm package. Which package did you
> use to measure this?

This is true for bsddb backed by Berkeley DB, but it should be true  
to some extent in any btree-based database.

btrees are much more efficiently-constructed if built in key-order  
(especially when they don't fit in memory), and the difference stems  
purely from the nature of the representation: the little-endian byte  
representation of sorted integers is no longer sorted.  The big- 
endian representation preserves the sort order.

>> class MyIntDB(object):
>>         def __setitem__(self, key, item):
>>                self.db.put(struct.pack('>Q', key), serializer(item))
>>          def __getitem__(self, key):
>>                return unserializer(self.db.get(struct.pack('>Q',  
>> key)))
>>
>> How do you envision these types of tasks being accomplished with
>> unicode keys?  It is conceivable that one could write a custom
>> unicode encoding that accomplishes this, convert the key to unicode,
>> and pass the custom encoding name to the constructor.
>
> Well, the *easiest* (I don't know about simplest) way to use ints as
> keys is of course to use the decimal representation. You'd use
> str(key) instead of struct.pack(). This would of course not maintain
> key order -- is that important? If you need to be compatible with
> struct.pack(), and we were to choose Unicode strings for the keys in
> the API, then you might have to do something like
> struct.pack(...).encode("latin-1") and specify latin-1 as the
> database's key encoding.

The decimal representation would work if it were left-padded  
appropriately, though it would be somewhat space-inefficient.  The  
second option you propose is likely the most feasible.

> Of course this may not be compatible with an external constraint (e.g.
> another application that already has a key format) but in that case
> you may have to use arbitrary tricks anyway (the latin-1 encoding
> might still be helpful).
>
> However, I give you that a pure bytes API would be more convenient  
> at times.
>
> How about we define two APIs, using raw bytes and one using strings +
> a given encoding?
>

> Or perhaps a special value of the encoding argument passed to
> *dbm.open() (maybe None, maybe the default, maybe "raw" or "bytes"?)
> to specify that the key values are to be bytes?

Either option sounds fine, but you're still left with the need to  
implement the raw bytes version in dumbdbm.

ISTM that this issue boils down to the question of if and how byte  
sequences can be hashed in py3k.  Assuming you are trying to  
implement a file parser that dispatches to various methods based on  
binary data, what is the pythonic way to do this in py3k?

One option is to .decode('latin-1') and dispatch on the (meaningless)  
text.  Another is for something like str8 to be kept around.  Yet  
another is to use non-hash-based datastructures (like trees) to  
implement these algorithms.

-Mike