[Python-Dev] When do sets shrink?

Sun Jan 1 00:08:26 CET 2006

On 12/31/05, Raymond Hettinger <raymond.hettinger at verizon.net> wrote:
> [Noam]
> > For example, iteration over a set which once had
> > 1,000,000 members and now has 2 can take 1,000,000 operations every
> > time you traverse all the (2) elements.
>
> Do you find that to be a common or plausible use case?

I don't have a concrete example in this minute, but a set which is
repeatedly filled with elements and then emptied by pop operations
doesn't seem to me that far-fetched.
>
> Was Guido's suggestion of s=set(s) unworkable for some reason?  dicts
> and sets emphasize fast lookups over fast iteration -- apps requiring
> many iterations over a collection may be better off converting to a list
> (which has no dummy entries or empty gaps between entries).

It's workable, but I think that most Python programmers haven't read
that specific message, and are expecting operations which should take
a short time to take a short time. Converting to a list won't help the
use-case above, and anyway, it's another thing that I wouldn't expect
anyone to do - there's no reason that iteration over a set should take
a long time.

(I'm speaking of my point of view, which I believe is common. I don't
expect programs I write in Python to be super-fast - if that were the
case, I would write them in C. I do expect them to take a reasonable
amount of time, and in the case of iteration over a set, that means a
time proportional to the number of elements in the set.)
>
> Would the case be improved by incurring the time cost of 999,998 tests
> for possible resizing (one for each pop) and some non-trivial number of
> resize operations along the way (each requiring a full-iteration over
> the then current size)?
>
I believe it would. It seems to me that those 999,998 tests take not
much more than a machine clock, which means about 1 milisecond on
todays computers. Those resize operations will take some more
miliseconds. It all doesn't really matter, since probably all other
things will take much more. I now run this code

>>> s = set()
>>> for j in xrange(1000000):
...     s.add(j)
...
>>> while s:
...     tmp = s.pop()
...

And it took 2.4 seconds to finish. And it's okay - I'm just saying
that a few additional clock ticks per operation will usually not
matter when the overall complexity is the same, but changes in order
of complexity can matter much more.

> Even if this unique case could be improved, what is the impact on common
> cases?  Would a downsizing scheme risk thrashing with the
> over-allocation scheme in cases with mixed adds and pops?
>
I think that there shouldn't be additional damage beyond those clock
ticks. The simple method I took from "Introduction to Algorithms"
works no matter what sequence of adds and pops you have.

> Is there any new information/research beyond what has been obvious from
> the moment the dict resizing scheme was born?

I wanted to say that there isn't any new information, and yet I don't
think that I have to assume that everything in current Python is the
best that can be. All I did was finding another reason why a
downsizing scheme might be good, and posting it to ask if people have
thought about it. If you have a document listing all the design
decisions that went into dict implementation, then please send it to
me and I won't ask about things that were already thought about.

But the answer is, yes. I beleive that the current dict resizing
scheme was born before the iterator protocol was introduced, and it
may be a reason why the current scheme doesn't try to minimize the
number of empty hash table entries.

Noam