[Python-Dev] I think my set module is ready for prime time; comments?

M.-A. Lemburg mal@lemburg.com
Wed, 24 Jan 2001 15:46:10 +0100

Jim Fulton wrote:
> Christopher Petrilli wrote:
> >
> > Neil Schemenauer [nas@arctrix.com] wrote:
> > > On Tue, Jan 23, 2001 at 02:06:05PM -0500, Christopher Petrilli wrote:
> > > > Unfortunately, for me, a Python implementation of Sets is only
> > > > interesting academicaly.  Any time I've needed to work with them at a
> > > > large scale, I've needed them *much* faster than Python could achieve
> > > > without a C extension.
> > >
> > > I think this argues that if sets are added to the core they
> > > should be implemented as an extension type with the speed of
> > > dictionaries and the memory usage of lists.  Basicly, we would
> > > use the implementation of PyDict but drop the values.
> >
> > This is effectively the implementation that Zope has for Sets.
> Except we use sorted collections with binary search for sets.
> I think that a simple hash-based set would make alot of sense.
> > In
> > addition we have "buckets" that have scores on them (which are
> > implemented as a modified BTree).
> >
> > Unfortunately Jim Fulton (who wrote all the code for that level) is in
> > a meeting, but I hope he'll comment on the implementation that was
> > chosen for our software.
> We have a number of special needs:
>   - Scalability is critical. We make some special opimizations,
>     like sets of integers and mapping objects with integer keys
>     and values. In these cases, data are stored using C int arrays,
>     allowing very efficient data storage and manipulation, especially
>     when using integer keys.
>   - We need to spread data over multiple database records. Our data
>     structures may be hundreds of megabytes in size. We have ZODB-aware
>     structures that use multiple independently stored database objects.
>   - Range searches are very common, and under some circomstances,
>     sorted collections and BTrees can have very little overhead
>     compared to dictionaries. For this reason, out mapping objects
>     and sets have been based on BTrees and sorted collections.
> Unfortunately, our current BTree implementation has a flaw that
> causes excessive number of objects to be updated when items are
> added and removed. (Each BTree internal node keeps track of the number
> of objects contained in it.)  Also, out current sets are limited
> to integers and cannot be spread over multiple database records.
> We are completing a new BTree implementation that overcomes these
> limitations.  IN this implementation, we will provide sets as
> value-less BTrees.

You may want to check out a soon to be released new mx
package: mxBeeBase. This is an on-disk b+tree implementation
which supports data files up to 2GB on 32-bit platforms.

Here's a preview:


(The links on that page are not functional.)

Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/