[Python-Dev] I think my set module is ready for prime time;
Wed, 24 Jan 2001 15:46:10 +0100
Jim Fulton wrote:
> Christopher Petrilli wrote:
> > Neil Schemenauer [email@example.com] wrote:
> > > On Tue, Jan 23, 2001 at 02:06:05PM -0500, Christopher Petrilli wrote:
> > > > Unfortunately, for me, a Python implementation of Sets is only
> > > > interesting academicaly. Any time I've needed to work with them at a
> > > > large scale, I've needed them *much* faster than Python could achieve
> > > > without a C extension.
> > >
> > > I think this argues that if sets are added to the core they
> > > should be implemented as an extension type with the speed of
> > > dictionaries and the memory usage of lists. Basicly, we would
> > > use the implementation of PyDict but drop the values.
> > This is effectively the implementation that Zope has for Sets.
> Except we use sorted collections with binary search for sets.
> I think that a simple hash-based set would make alot of sense.
> > In
> > addition we have "buckets" that have scores on them (which are
> > implemented as a modified BTree).
> > Unfortunately Jim Fulton (who wrote all the code for that level) is in
> > a meeting, but I hope he'll comment on the implementation that was
> > chosen for our software.
> We have a number of special needs:
> - Scalability is critical. We make some special opimizations,
> like sets of integers and mapping objects with integer keys
> and values. In these cases, data are stored using C int arrays,
> allowing very efficient data storage and manipulation, especially
> when using integer keys.
> - We need to spread data over multiple database records. Our data
> structures may be hundreds of megabytes in size. We have ZODB-aware
> structures that use multiple independently stored database objects.
> - Range searches are very common, and under some circomstances,
> sorted collections and BTrees can have very little overhead
> compared to dictionaries. For this reason, out mapping objects
> and sets have been based on BTrees and sorted collections.
> Unfortunately, our current BTree implementation has a flaw that
> causes excessive number of objects to be updated when items are
> added and removed. (Each BTree internal node keeps track of the number
> of objects contained in it.) Also, out current sets are limited
> to integers and cannot be spread over multiple database records.
> We are completing a new BTree implementation that overcomes these
> limitations. IN this implementation, we will provide sets as
> value-less BTrees.
You may want to check out a soon to be released new mx
package: mxBeeBase. This is an on-disk b+tree implementation
which supports data files up to 2GB on 32-bit platforms.
Here's a preview:
(The links on that page are not functional.)
Python Pages: http://www.lemburg.com/python/