[Python-Dev] When do sets shrink?

Thu Dec 29 18:52:41 CET 2005

Noam Raphael <noamraph at gmail.com> wrote:
> On 12/29/05, Fredrik Lundh <fredrik at pythonware.com> wrote:
> > Noam Raphael wrote:
> >
> > > I'm not saying that practically it must be used - I'm just saying that
> > > it can't be called a heuristic, and that it doesn't involve any "fancy
> > > overkill size hinting or history tracking". It actually means
> > > something like this:
> > > 1. If you want to insert and the table is full, resize the table to
> > > twice the current size.
> > > 2. If you delete and the number of elements turns out to be less than
> > > a quarter of the size of the table, resize the table to half of the
> > > current size.
> >
> > sure sounds like a heuristic algorithm to me... (as in "not guaranteed to
> > be optimal under all circumstances, even if it's probably quite good in all
> > practical cases")
> 
> I'm not saying it's optimal, but it is really amortized O(1) per
> insert/delete. I looked up in "Introduction to Algorithms" for this,
> and it has a complicated explanation. A simple explanation is that
> after every resize the table is exactly half-full. Let's say it has n
> elements and the table size is 2*n. To get to the next resize, you
> have to do at least n/2 removals of elements, or n insertion of
> elements. After that, you do a resize operation. In either case, you
> do an O(n) resize operation after at least O(n) insertions/removals
> which are O(1) operations. This means that the toal cost remains O(n)
> per n simple operations, which you can say is O(1) per simple
> operation.
> 
> I hope that if you read this slowly it makes sense...

This is understood by (I would expect) most people here (hash-tables
are (theoretically and practically) average as you state, but
(theoretically) worst-case as Martin states).

For resizing, a quick-and-dirty rule of thumb is that if you are
overallocating by a factor of f(n), the amount of work you will be
performing per insertion is ~cn/f(n) (1 <= c <= 2).  As per recent
discussions on lists, Python chooses f(n) to be n/8. (at least in the
list case) This says that every insertion is taking around ~8 memory
copies if/when the list is resized up, but practical experience has
shown that it also tends to minimize memory usage as the list grows.

It's all about tradeoffs.  Increasing general memory usage for the sake
of a lower constant or not is a tradeoff.  As is resizing or not
resizing as a list gets smaller.

Would changing the overallocation strategy change Python's performance? 
Likely, but possibly not noticable.  Would it change Python memory usage?
Yes. A vast majority of list use would cause larger allocations than
currently performed by the list allocator.  Want to test it?  Create a
micro benchmark which tests repeated append performance with the two
list allocation strategies and remember to note the memory usage of
Python after each test.

 - Josiah