[Python-ideas] Uniquify attribute for lists
Andrew Barnert
abarnert at yahoo.com
Fri Nov 23 00:57:33 CET 2012
From: Joshua Landau <joshua.landau.ws at gmail.com>
Sent: Thu, November 22, 2012 2:34:30 PM
>
>On 22 November 2012 00:06, Andrew Barnert <abarnert at yahoo.com> wrote:
>
>From: Joshua Landau <joshua.landau.ws at gmail.com>
>>Sent: Sat, November 17, 2012 11:38:22 AM
>>
>>
>>>Surely the best choice is two have *two* caches; one for hashables and
another
>>>for the rest.
>>
>>Your implementation does a try: hash() to decide whether to check the set or
>the
>>list, instead of just doing a try: item in set except: item in list. Is there
a
>>reason for this? It's more complicated, and it's measurably slower.
> I did not realise that "[] in set()" raised an error! I'd just assumed it
> returned False.
I just realized that this doesn't seem to be documented anywhere. It's obvious
that set.add would have to raise for a non-hashable, but x in set could be
implemented as either raising or returning False without violating any of the
requirements at http://docs.python.org/3/library/stdtypes.html#set or anywhere
else that I can see…
I did a quick test with PyPy and Jython built-in sets, the old sets module, and
the Python recipe that used to be linked for pre-2.3 compat, and they all do the
same thing as CPython. (The pure Python versions are all just implemented as a
dict where all the values are True.) But that still doesn't necessarily
guarantee that it's safe in all possible future implementations…
Maybe the documentation should be updated to guarantee this—it's a useful thing
to rely on, all current implementations provide it, and it's hard to think of a
good reason why breaking it could improve performance or implementation
simplicity.
> Well, I'd sort-of assumed that this included adding sorted collection to the
> mix, as it isn't in the standard library.
Yes, as I said later, that's the biggest reason not to consider it as a general
solution.
> You *cannot* assume that a data set has total ordering on the basis that it's
> working so far.
You're right. I was thinking that a sorted collection should reject adding
elements that aren't totally ordered with the existing elements… but now that I
think about it, there's obviously no way to do that in O(log N) time.
>> And
>>as for speed, it'll be O(NlogM) instead of O(NM) for N elements with M unique,
>>which is obviously better, but probably slower for tiny M, and another 5-10%
>>overhead for inappropriate values.
>>
>
> Well yes... bar the fact that you may be using it on something with a non-even
> distribution of "things" where some types are not comparable to each-other:
I didn't speak very precisely here, because it's hard to be concise, but the
total performance is O(A) + O(BlogM) + O(CN), where A is the number of hashable
elements, B is the number of non-hashable but sortable elements that are
comparable to the first non-hashable but sortable element, M is the number of
unique elements within B, C is the number of elements that are neither hashable
nor comparable with the elements of B, and N is the number of unique elements
within C.
The point is that, if a significant subset of the elements are in B, this will
be better than O(A)+O(CN); otherwise, it'll be the same. Just as O(A)+O(CN) is
better than O(CN) if a significant subset of the elements are in A, otherwise
the same. So, it's an improvement in the same way that adding the set is an
improvement.
> *And* then there's the fact that sorted collections have intrinsically more
> overhead, and so are likely to give large overhead.
I mentioned that later (and you commented on it). When M is very small
(especially if B is also very large), there's a substantial added cost. Of
course the same is true for the set, but the crossover for the set happens
somewhere between 2-10 unique elements instead of 100, and the cost below that
crossover is much smaller.
>>At any rate, I tried a few different sorted collections. The performance for
>>single-digit M was anywhere from 2.9x slower to 38x slower (8x with blist);
the
>>crossover was around M=100, and you get 10x faster by around M=100K. Deciding
>>whether this is appropriate, and which implementation to use, and so on… well,
>>that's exactly why there's no sorted list in the stdlib in the first place.
> Thank you for the numbers. May I ask what libraries you used?
* blist (PyPI): hybrid B+Tree/array
* pyavl (PyPI): AVL tree
* bintrees (PyPI): AVL tree and red-black tree
* opus7 (http://www.brpreiss.com/books/opus7/): AVL tree
* libc++ std::set (incomplete hacked-up Cython interface): red-black tree
* CHDataStructures (via PyObjC): not sure
* java.util.TreeSet (via jython): red-black tree
* java.util.concurrrent.ConcurrentSkipListSet: skip-list
* QtCore.QMap (via PyQt4): skip-list
Some of these are hacked-up implementations that only handle just enough of the
interface I need, in some cases even faking the comparisons, and in some cases
not even complete enough to run the real test (so I just tested the time to
test-and-insert B values M/B values). So, it's not a scientific test or
anything, but they were all in the expected ballpark (and the few that weren't
turned out not to be O(log N) insert time, so I threw them out).
The most thorough tests were with blist; I think I posted the complete numbers
before, but the short version is: 8.0x with 2 unique values; 1.1x with 64; 0.9x
with 128; 0.1x with 128K, all with 256K total values.
More information about the Python-ideas
mailing list