[Python-Dev] I think my set module is ready for prime time;
comments?
M.-A. Lemburg
mal@lemburg.com
Wed, 24 Jan 2001 15:46:10 +0100
Jim Fulton wrote:
>
> Christopher Petrilli wrote:
> >
> > Neil Schemenauer [nas@arctrix.com] wrote:
> > > On Tue, Jan 23, 2001 at 02:06:05PM -0500, Christopher Petrilli wrote:
> > > > Unfortunately, for me, a Python implementation of Sets is only
> > > > interesting academicaly. Any time I've needed to work with them at a
> > > > large scale, I've needed them *much* faster than Python could achieve
> > > > without a C extension.
> > >
> > > I think this argues that if sets are added to the core they
> > > should be implemented as an extension type with the speed of
> > > dictionaries and the memory usage of lists. Basicly, we would
> > > use the implementation of PyDict but drop the values.
> >
> > This is effectively the implementation that Zope has for Sets.
>
> Except we use sorted collections with binary search for sets.
>
> I think that a simple hash-based set would make alot of sense.
>
> > In
> > addition we have "buckets" that have scores on them (which are
> > implemented as a modified BTree).
> >
> > Unfortunately Jim Fulton (who wrote all the code for that level) is in
> > a meeting, but I hope he'll comment on the implementation that was
> > chosen for our software.
>
> We have a number of special needs:
>
> - Scalability is critical. We make some special opimizations,
> like sets of integers and mapping objects with integer keys
> and values. In these cases, data are stored using C int arrays,
> allowing very efficient data storage and manipulation, especially
> when using integer keys.
>
> - We need to spread data over multiple database records. Our data
> structures may be hundreds of megabytes in size. We have ZODB-aware
> structures that use multiple independently stored database objects.
>
> - Range searches are very common, and under some circomstances,
> sorted collections and BTrees can have very little overhead
> compared to dictionaries. For this reason, out mapping objects
> and sets have been based on BTrees and sorted collections.
>
> Unfortunately, our current BTree implementation has a flaw that
> causes excessive number of objects to be updated when items are
> added and removed. (Each BTree internal node keeps track of the number
> of objects contained in it.) Also, out current sets are limited
> to integers and cannot be spread over multiple database records.
>
> We are completing a new BTree implementation that overcomes these
> limitations. IN this implementation, we will provide sets as
> value-less BTrees.
You may want to check out a soon to be released new mx
package: mxBeeBase. This is an on-disk b+tree implementation
which supports data files up to 2GB on 32-bit platforms.
Here's a preview:
http://www.lemburg.com/python/mxBeeBase.html
(The links on that page are not functional.)
--
Marc-Andre Lemburg
______________________________________________________________________
Company: http://www.egenix.com/
Consulting: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/