Re: [Python-ideas] Implement __add__ for set and frozenset

Hi Arnaud I don't know if you'll do any benchmarking on this, but I'd suggest: - First find the set with the smallest size (this is O(n) work). - If that size is sufficiently small and the number of sets is sufficiently large (numbers to be determined by testing), don't sort the sets by size - just go for it. - Else do the sorting. The point being that if the smallest set is already quite small, the size of the intersection is already tightly bounded and you're possibly going to do an expensive sort that's really not needed. The O(n) work to find the smallest is tiny compared to just blindly doing O(n lg n) immediately. Most of the juice you get from moving from small to big sets comes from starting with the smallest. A few benchmarks should give an idea of when to sort. BTW, having a quick look at your diff (not the patched source) it looks like you're testing each of the elements of the smallest set against all other hashtables. I haven't thought about it much, but that seems to partly defeat the purpose of sorting. Speed will depend on the inputs, but I'd have guessed that in general you should be testing each member of the smallest for presence in the next set, short-circuiting if empty, then testing each of the survivors against the next set, etc. That's more of a "vertical" approach than the horizontal one you take across all the hashtables (possibly also with a speed benefit due to locality of reference). Also, why not test first against the iterables that are not hashtables? Wouldn't that be faster in the (common?) case of many sets being passed for intersection? Sorry if this is all clueless - it's just my thinking as I looked at your diff. Regards, Terry

On 10 Jun 2008, at 00:33, Terry Jones wrote:
My first reaction is to agree with this. Just finding the smallest hashtable might be enough, and I first set out to do just that. Raymond Hettinger suggested going from smallest to largest and I decided against having too many code paths, without any real rationale (or rather, I probably had in my mind that it would be used for a small number of big sets, rather than a big number of small sets).
A few benchmarks should give an idea of when to sort.
One problem is that it is difficult to know what is a typical use of this. I imagined that the number of sets would be small compared with their sizes. It would be completely different if one had many small sets.
You're right about short-circuiting (with iterables), it was planned but I forgot to put it in. I was working on the patch last week but my son was taken to hospital in an emergency and everything has been a bit of a blur since then. Tonight for the first time I had a bit of time so I decided it was the time to wrap it up and send it, but I think that it was a bit rushed. I still believe short-circuiting has to be done "horizontally" when possible, to use your terminology. If x belongs to the first two sets, then you know that their intersection is not empty, so you might as well test for membership of the third set straight away. Although locality of reference may well be an important factor here, I don't know (last time I looked into these things, memory was fast and processors were slow - a long time ago!).
I first wrote the code for hashtables, then added support for any iterable. What you say is probably correct for many small iterables, but not for few big ones where you would have to go through the tedium of iterating through every element of each iterable (which is done in my implementation anyway because I forgot the short-circuiting!).
No, these are considerations that I should have given more thought to. Because it is the first time I modified a bit of Python code, I think I got bogged down in my problems with technicalities and forgot the bigger picture. -- Arnaud

On 10 Jun 2008, at 00:33, Terry Jones wrote:
My first reaction is to agree with this. Just finding the smallest hashtable might be enough, and I first set out to do just that. Raymond Hettinger suggested going from smallest to largest and I decided against having too many code paths, without any real rationale (or rather, I probably had in my mind that it would be used for a small number of big sets, rather than a big number of small sets).
A few benchmarks should give an idea of when to sort.
One problem is that it is difficult to know what is a typical use of this. I imagined that the number of sets would be small compared with their sizes. It would be completely different if one had many small sets.
You're right about short-circuiting (with iterables), it was planned but I forgot to put it in. I was working on the patch last week but my son was taken to hospital in an emergency and everything has been a bit of a blur since then. Tonight for the first time I had a bit of time so I decided it was the time to wrap it up and send it, but I think that it was a bit rushed. I still believe short-circuiting has to be done "horizontally" when possible, to use your terminology. If x belongs to the first two sets, then you know that their intersection is not empty, so you might as well test for membership of the third set straight away. Although locality of reference may well be an important factor here, I don't know (last time I looked into these things, memory was fast and processors were slow - a long time ago!).
I first wrote the code for hashtables, then added support for any iterable. What you say is probably correct for many small iterables, but not for few big ones where you would have to go through the tedium of iterating through every element of each iterable (which is done in my implementation anyway because I forgot the short-circuiting!).
No, these are considerations that I should have given more thought to. Because it is the first time I modified a bit of Python code, I think I got bogged down in my problems with technicalities and forgot the bigger picture. -- Arnaud
participants (2)
-
Arnaud Delobelle
-
Terry Jones