[Python-Dev] [Python-checkins] python/dist/src/Objectssetobject.c, 1.45, 1.46

Fri Aug 12 01:36:20 CEST 2005

[Jim Jewett]
> (1)  Is there a reason that you never shrink sets for
discard/remove/pop?

Yes, to avoid adding an O(n) step to what would otherwise be an O(1)
operation.  These tight, granular methods are so fast that even checking
for potential resizes would impact their performance.

Also, I was keeping the dict philosophy of shrinking only when an item
is added.  That approach prevents thrashing in the face of a series of
alternating add/pop operations.

OTOH, practicality beats purity.  Set differencing demands some
downsizing code (see below).

> (set difference will do a one-time shrink, if there are enough dummy
> entries, but even then, it doesn't look at the %filled, so a
> merge-related overallocation will stick around)
> 
> I note the you do the same with dicts, but I think sets are a more
> natural candidate for "this is the set of things I still have to
> process, in any order".  (I suppose enforcing an order with deque may
> be faster -- unless I'm worried about duplicates.)

It is all about balancing trade-offs.  Dummies have very little impact
on iteration speed, it is the used/(mask+1) sparseness ratio that
matters.  Also, they have very little impact on lookup time unless the
table is nearly full (and it affects not-found searches more than
successful searches).  Resizing is not a cheap operation.  The right
balance is very likely application dependent.  For now, my goal is to
deviate from dict code only for clear improvements (i.e. lookups based
on entry rather than just the key).

> (2)  When adding an element, you check that
> 
>     if (!(so->used > n_used && so->fill*3 >= (so->mask+1)*2))
> 
> Is there any reason to use that +1?  Without it, resizes will happen
> element sooner, but probably not much more often -- and you could
> avoid an add on every insert.
> (I suppose dictionaries have the same question.)

Without the +1, small dicts and sets could only hold four entries
instead of five (which has shown itself to be a better cutoff point).

Even if this didn't apply to sets, I still aspire to keep true to
dictobject.c.  That code has been thoroughly tested and tuned.  By
starting with mature code, I've saved years of evolution.  Also, there
is a maintenance benefit -- developers familiar with dictobject.c will
find setobject.c to be instantly recognizable.  There is only one new
trick, set_swap_bodies(), and that is thoroughly commented.

> (3)  In set_merge, when finding the new size, you use (so->fill +
other-
> >used)
> 
> Why so->fill?  If it is different from so->used, then the extras are
> dummy entries that it would be good to replace.
> (I note that dictobject does use ->used.)

The cop-out answer is that this is what is done in PyDict_Merge().

I believe the reasoning behind that design was to provide the best guess
as to the maximum amount of space that could be consumed by the
impending insertions.  If they will all fit, then resizing is skipped.
The approach reflects a design that values avoiding resizes more than it
values eliminating dummy entries.  AFAICT, dummy elimination is a
by-product of resizing rather than its goal.

With sets, I followed that design except for set differencing.  In
dictionaries, there is no parallel operation of mass deletion.  I had to
put in some control so that s-=t wouldn't leave a giant set with only a
handful of non-dummy entries.  This reflects the space saving goal for
the Py2.5 updates.  There was also a goal to eliminate redundant calls
to PyObject_Hash().  The nice performance improvement was an unexpected
bonus.

Your questions are good.  Thanks for reading the code and thinking about
it.  Hope you enjoy the new implementation which for the first time can
outperform dicts in terms of both space and speed.

Raymond