[Python-ideas] Implement __add__ for set and frozenset

Terry Jones terry at jon.es
Tue Jun 10 01:33:07 CEST 2008


>>>>> "Arnaud" == Arnaud Delobelle <arnodel at googlemail.com> writes:
Arnaud> I've written a patch [1] that does that.  Following the suggestion
Arnaud> of Raymond Hettinger, I've implemented set.intersection by sorting
Arnaud> all its sets/frozensets/dicts in increasing order of size first,
Arnaud> then iterating over the smallest.

Hi Arnaud

I don't know if you'll do any benchmarking on this, but I'd suggest:

 - First find the set with the smallest size (this is O(n) work).

 - If that size is sufficiently small and the number of sets is
   sufficiently large (numbers to be determined by testing), don't sort the
   sets by size - just go for it.

 - Else do the sorting.

The point being that if the smallest set is already quite small, the size
of the intersection is already tightly bounded and you're possibly going to
do an expensive sort that's really not needed. The O(n) work to find the
smallest is tiny compared to just blindly doing O(n lg n) immediately. Most
of the juice you get from moving from small to big sets comes from starting
with the smallest.

A few benchmarks should give an idea of when to sort.

BTW, having a quick look at your diff (not the patched source) it looks
like you're testing each of the elements of the smallest set against all
other hashtables. I haven't thought about it much, but that seems to partly
defeat the purpose of sorting. Speed will depend on the inputs, but I'd
have guessed that in general you should be testing each member of the
smallest for presence in the next set, short-circuiting if empty, then
testing each of the survivors against the next set, etc. That's more of a
"vertical" approach than the horizontal one you take across all the
hashtables (possibly also with a speed benefit due to locality of
reference).

Also, why not test first against the iterables that are not hashtables?
Wouldn't that be faster in the (common?) case of many sets being passed for
intersection?

Sorry if this is all clueless - it's just my thinking as I looked at your
diff.

Regards,
Terry



More information about the Python-ideas mailing list