[Python-ideas] Uniquify attribute for lists

Andrew Barnert abarnert at yahoo.com
Fri Nov 23 00:57:33 CET 2012


From: Joshua Landau <joshua.landau.ws at gmail.com>
Sent: Thu, November 22, 2012 2:34:30 PM
>
>On 22 November 2012 00:06, Andrew Barnert <abarnert at yahoo.com> wrote:
>
>From: Joshua Landau <joshua.landau.ws at gmail.com>
>>Sent: Sat, November 17, 2012 11:38:22 AM
>>
>>
>>>Surely the best choice is two have *two* caches; one for hashables and 
another
>>>for the rest.
>>
>>Your implementation does a try: hash() to decide whether to check the set or 
>the
>>list, instead of just doing a try: item in set except: item in list. Is there 
a
>>reason for this? It's more complicated, and it's measurably slower.

> I did not realise that "[] in set()" raised an error! I'd just assumed it 
> returned False.

I just realized that this doesn't seem to be documented anywhere. It's obvious 
that set.add would have to raise for a non-hashable, but x in set could be 
implemented as either raising or returning False without violating any of the 
requirements at http://docs.python.org/3/library/stdtypes.html#set or anywhere 
else that I can see…

I did a quick test with PyPy and Jython built-in sets, the old sets module, and 
the Python recipe that used to be linked for pre-2.3 compat, and they all do the 
same thing as CPython. (The pure Python versions are all just implemented as a 
dict where all the values are True.) But that still doesn't necessarily 
guarantee that it's safe in all possible future implementations…

Maybe the documentation should be updated to guarantee this—it's a useful thing 
to rely on, all current implementations provide it, and it's hard to think of a 
good reason why breaking it could improve performance or implementation 
simplicity.

> Well, I'd sort-of assumed that this included adding  sorted collection to the
> mix, as it isn't in the standard library.

Yes, as I said later, that's the biggest reason not to consider it as a general 
solution.
 
> You *cannot* assume that a data set has total ordering on the basis that it's 
> working so far.

You're right. I was thinking that a sorted collection should reject adding 
elements that aren't totally ordered with the existing elements… but now that I 
think about it, there's obviously no way to do that in O(log N) time.
 
>> And
>>as for speed, it'll be O(NlogM) instead of O(NM) for N elements with M unique,
>>which is obviously better, but probably slower for tiny M, and another 5-10%
>>overhead for inappropriate values.
>>
>
> Well yes... bar the fact that you may be using it on something with a non-even 

> distribution of "things" where some types are not comparable to each-other:

I didn't speak very precisely here, because it's hard to be concise, but the 
total performance is O(A) + O(BlogM) + O(CN), where A is the number of hashable 
elements, B is the number of non-hashable but sortable elements that are 
comparable to the first non-hashable but sortable element, M is the number of 
unique elements within B, C is the number of elements that are neither hashable 
nor comparable with the elements of B, and N is the number of unique elements 
within C.

The point is that, if a significant subset of the elements are in B, this will 
be better than O(A)+O(CN); otherwise, it'll be the same. Just as O(A)+O(CN) is 
better than O(CN) if a significant subset of the elements are in A, otherwise 
the same. So, it's an improvement in the same way that adding the set is an 
improvement.

> *And* then there's the fact that sorted collections have intrinsically more 
> overhead, and so are likely to give large overhead.

I mentioned that later (and you commented on it). When M is very small 
(especially if B is also very large), there's a substantial added cost. Of 
course the same is true for the set, but the crossover for the set happens 
somewhere between 2-10 unique elements instead of 100, and the cost below that 
crossover is much smaller.

>>At any rate, I tried a few different sorted collections. The performance for
>>single-digit M was anywhere from 2.9x slower to 38x slower (8x with blist); 
the
>>crossover was around M=100, and you get 10x faster by around M=100K. Deciding
>>whether this is appropriate, and which implementation to use, and so on… well,
>>that's exactly why there's no sorted list in the stdlib in the first place.

> Thank you for the numbers. May I ask what libraries you used?

* blist (PyPI): hybrid B+Tree/array
* pyavl (PyPI): AVL tree
* bintrees (PyPI): AVL tree and red-black tree
* opus7 (http://www.brpreiss.com/books/opus7/): AVL tree
* libc++ std::set (incomplete hacked-up Cython interface): red-black tree
* CHDataStructures (via PyObjC): not sure
* java.util.TreeSet (via jython): red-black tree
* java.util.concurrrent.ConcurrentSkipListSet: skip-list
* QtCore.QMap (via PyQt4): skip-list

Some of these are hacked-up implementations that only handle just enough of the 
interface I need, in some cases even faking the comparisons, and in some cases 
not even complete enough to run the real test (so I just tested the time to 
test-and-insert B values M/B values). So, it's not a scientific test or 
anything, but they were all in the expected ballpark (and the few that weren't 
turned out not to be O(log N) insert time, so I threw them out).

The most thorough tests were with blist; I think I posted the complete numbers 
before, but the short version is: 8.0x with 2 unique values; 1.1x with 64; 0.9x 
with 128; 0.1x with 128K, all with 256K total values.



More information about the Python-ideas mailing list