[Python-Dev] When do sets shrink?

Josiah Carlson jcarlson at uci.edu
Sun Jan 1 01:42:19 CET 2006


Noam Raphael <noamraph at gmail.com> wrote:
> 
> On 12/31/05, Raymond Hettinger <raymond.hettinger at verizon.net> wrote:
> > [Noam]
> > > For example, iteration over a set which once had
> > > 1,000,000 members and now has 2 can take 1,000,000 operations every
> > > time you traverse all the (2) elements.
> >
> > Do you find that to be a common or plausible use case?
> 
> I don't have a concrete example in this minute, but a set which is
> repeatedly filled with elements and then emptied by pop operations
> doesn't seem to me that far-fetched.

It doesn't seem far-fetched, but I've not seen anything like it.  List
appending and popping, yeah, set differences and intersections and
unions, yeah, but not set insertion then removal for large numbers of
items.

Note that you provide insertion into a set then repeated popping as an
example, which is done faster by other methods.

> (I'm speaking of my point of view, which I believe is common. I don't
> expect programs I write in Python to be super-fast - if that were the
> case, I would write them in C. I do expect them to take a reasonable
> amount of time, and in the case of iteration over a set, that means a
> time proportional to the number of elements in the set.)

That is a reasonable point of view.  But realize that depending on the
shrinking strategy, popping/deletion will take ammortized 2+ times
longer than they do now, and whose benefits include (and are basically
limited to): memory cam be freed to the operating system, repeated
iteration over a resized-smaller dictionary can be faster.

> > Would the case be improved by incurring the time cost of 999,998 tests
> > for possible resizing (one for each pop) and some non-trivial number of
> > resize operations along the way (each requiring a full-iteration over
> > the then current size)?
> >
> I believe it would. It seems to me that those 999,998 tests take not
> much more than a machine clock, which means about 1 milisecond on
> todays computers. Those resize operations will take some more
> miliseconds. It all doesn't really matter, since probably all other
> things will take much more. I now run this code
> 
> >>> s = set()
> >>> for j in xrange(1000000):
> ...     s.add(j)
> ...
> >>> while s:
> ...     tmp = s.pop()
> ...
> 
> And it took 2.4 seconds to finish. And it's okay - I'm just saying
> that a few additional clock ticks per operation will usually not
> matter when the overall complexity is the same, but changes in order
> of complexity can matter much more.

Doing that while loop will take _longer_ with a constantly resizing set. 
The only way that resizing a dict/set as it gets smaller will increase
overall running speed is if iteration over the dict/set occurs anywhere
between 2-100 times (depending on the resizing factor)


> > Even if this unique case could be improved, what is the impact on common
> > cases?  Would a downsizing scheme risk thrashing with the
> > over-allocation scheme in cases with mixed adds and pops?
> >
> I think that there shouldn't be additional damage beyond those clock
> ticks. The simple method I took from "Introduction to Algorithms"
> works no matter what sequence of adds and pops you have.

You may get more memory fragmentation, depending on the underlying
memory manager.


> > Is there any new information/research beyond what has been obvious from
> > the moment the dict resizing scheme was born?
> 
> I wanted to say that there isn't any new information, and yet I don't
> think that I have to assume that everything in current Python is the
> best that can be. All I did was finding another reason why a
> downsizing scheme might be good, and posting it to ask if people have
> thought about it. If you have a document listing all the design
> decisions that went into dict implementation, then please send it to
> me and I won't ask about things that were already thought about.

See the source for dictobject.c and dictnotes.txt:
http://svn.python.org/view/python/trunk/Objects/dictobject.c?rev=39608&view=auto
http://svn.python.org/view/python/trunk/Objects/dictnotes.txt?rev=35428&view=auto


 - Josiah



More information about the Python-Dev mailing list