[Python-ideas] Implement __add__ for set and frozenset

Arnaud Delobelle arnodel at googlemail.com
Tue Jun 10 02:31:06 CEST 2008


On 10 Jun 2008, at 00:33, Terry Jones wrote:

>>>>>> "Arnaud" == Arnaud Delobelle <arnodel at googlemail.com> writes:
> Arnaud> I've written a patch [1] that does that.  Following the  
> suggestion
> Arnaud> of Raymond Hettinger, I've implemented set.intersection by  
> sorting
> Arnaud> all its sets/frozensets/dicts in increasing order of size  
> first,
> Arnaud> then iterating over the smallest.
>
> Hi Arnaud
>
> I don't know if you'll do any benchmarking on this, but I'd suggest:
>
> - First find the set with the smallest size (this is O(n) work).
>
> - If that size is sufficiently small and the number of sets is
>   sufficiently large (numbers to be determined by testing), don't  
> sort the
>   sets by size - just go for it.
>
> - Else do the sorting.
>
> The point being that if the smallest set is already quite small, the  
> size
> of the intersection is already tightly bounded and you're possibly  
> going to
> do an expensive sort that's really not needed. The O(n) work to find  
> the
> smallest is tiny compared to just blindly doing O(n lg n)  
> immediately. Most
> of the juice you get from moving from small to big sets comes from  
> starting
> with the smallest.
>

My first reaction is to agree with this.  Just finding the smallest  
hashtable might be enough, and I first set out to do just that.   
Raymond Hettinger suggested going from smallest to largest and I  
decided against having too many code paths, without any real rationale  
(or rather, I probably had in my mind that it would be used for a  
small number of big sets, rather than a big number of small sets).


> A few benchmarks should give an idea of when to sort.
>

One problem is that it is difficult to know what is a typical use of  
this.  I imagined that the number of sets would be small compared with  
their sizes.  It would be completely different if one had many small  
sets.

> BTW, having a quick look at your diff (not the patched source) it  
> looks
> like you're testing each of the elements of the smallest set against  
> all
> other hashtables. I haven't thought about it much, but that seems to  
> partly
> defeat the purpose of sorting. Speed will depend on the inputs, but  
> I'd
> have guessed that in general you should be testing each member of the
> smallest for presence in the next set, short-circuiting if empty, then
> testing each of the survivors against the next set, etc. That's more  
> of a
> "vertical" approach than the horizontal one you take across all the
> hashtables (possibly also with a speed benefit due to locality of
> reference).
>

You're right about short-circuiting (with iterables), it was planned  
but I forgot to put it in.  I was working on the patch last week but  
my son was taken to hospital in an emergency and everything has been a  
bit of a blur since then.  Tonight for the first time I had a bit of  
time so I decided it was the time to wrap it up and send it, but I  
think that it was a bit rushed.

I still believe short-circuiting has to be done "horizontally" when  
possible, to use your terminology.  If x belongs to the first two  
sets, then you know that their intersection is not empty, so you might  
as well test for membership of the third set straight away.  Although  
locality of reference may well be an important factor here, I don't  
know (last time I looked into these things, memory was fast and  
processors were slow - a long time ago!).

> Also, why not test first against the iterables that are not  
> hashtables?
> Wouldn't that be faster in the (common?) case of many sets being  
> passed for
> intersection?
>

I first wrote the code for hashtables, then added support for any  
iterable.  What you say is probably correct for many small iterables,  
but not for few big ones where you would have to go through the tedium  
of iterating through every element of each iterable (which is done in  
my implementation anyway because I forgot the short-circuiting!).

> Sorry if this is all clueless - it's just my thinking as I looked at  
> your
> diff.

No, these are considerations that I should have given more thought  
to.  Because it is the first time I modified a bit of Python code, I  
think I got bogged down in my problems with technicalities and forgot  
the bigger picture.

-- 
Arnaud




More information about the Python-ideas mailing list