[Python-Dev] I think my set module is ready for prime time; comments?

Jim Fulton jim@digicool.com
Wed, 24 Jan 2001 04:05:44 -0500


Christopher Petrilli wrote:
> 
> Neil Schemenauer [nas@arctrix.com] wrote:
> > On Tue, Jan 23, 2001 at 02:06:05PM -0500, Christopher Petrilli wrote:
> > > Unfortunately, for me, a Python implementation of Sets is only
> > > interesting academicaly.  Any time I've needed to work with them at a
> > > large scale, I've needed them *much* faster than Python could achieve
> > > without a C extension.
> >
> > I think this argues that if sets are added to the core they
> > should be implemented as an extension type with the speed of
> > dictionaries and the memory usage of lists.  Basicly, we would
> > use the implementation of PyDict but drop the values.
> 
> This is effectively the implementation that Zope has for Sets. 

Except we use sorted collections with binary search for sets.

I think that a simple hash-based set would make alot of sense.

> In
> addition we have "buckets" that have scores on them (which are
> implemented as a modified BTree).
> 
> Unfortunately Jim Fulton (who wrote all the code for that level) is in
> a meeting, but I hope he'll comment on the implementation that was
> chosen for our software.

We have a number of special needs:

  - Scalability is critical. We make some special opimizations, 
    like sets of integers and mapping objects with integer keys
    and values. In these cases, data are stored using C int arrays, 
    allowing very efficient data storage and manipulation, especially
    when using integer keys.

  - We need to spread data over multiple database records. Our data
    structures may be hundreds of megabytes in size. We have ZODB-aware
    structures that use multiple independently stored database objects.

  - Range searches are very common, and under some circomstances, 
    sorted collections and BTrees can have very little overhead
    compared to dictionaries. For this reason, out mapping objects
    and sets have been based on BTrees and sorted collections.

Unfortunately, our current BTree implementation has a flaw that
causes excessive number of objects to be updated when items are 
added and removed. (Each BTree internal node keeps track of the number
of objects contained in it.)  Also, out current sets are limited
to integers and cannot be spread over multiple database records.

We are completing a new BTree implementation that overcomes these 
limitations.  IN this implementation, we will provide sets as
value-less BTrees.

Jim

--
Jim Fulton           mailto:jim@digicool.com   Python Powered!        
Technical Director   (888) 344-4332            http://www.python.org  
Digital Creations    http://www.digicool.com   http://www.zope.org