[Python-ideas] Uniquify attribute for lists
abarnert at yahoo.com
Thu Nov 22 01:06:07 CET 2012
From: Joshua Landau <joshua.landau.ws at gmail.com>
Sent: Sat, November 17, 2012 11:38:22 AM
>Surely the best choice is two have *two* caches; one for hashables and another
>for the rest.
Your implementation does a try: hash() to decide whether to check the set or the
list, instead of just doing a try: item in set except: item in list. Is there a
reason for this? It's more complicated, and it's measurably slower.
>This might be improvable with a *third* chache if some non-hashables had total
>ordering, but that could also introduce bugs I think. It'd also be a lot harder
>and likely be slower anyway.
I agree that it's probably not worth adding to something in the standard
library, or a recipe given in the documentation (in fact, I think I already said
as much earlier in the thread), but I think you've got most of those facts
It's not a lot harder. The same 4 lines you have to add to do a
try-set-except-list, you just do again, so it's
try-set-except-try-sortedlist-except-list. And it won't introduce any bugs. And
as for speed, it'll be O(NlogM) instead of O(NM) for N elements with M unique,
which is obviously better, but probably slower for tiny M, and another 5-10%
overhead for inappropriate values.
The problem is finding an appropriate sortedcollection type. There isn't one in
the standard library. There's a link to an external SortedCollection reference
in the bisect docs page, but that has O(N) insertion time, so it won't help. The
most popular library I know of is blist.sortedlist, and that works, but it has
quite a bit of overhead for searching tiny lists. As I understand it, the reason
there isn't a standard sorted collection is the assumption that people dealing
with huge sequences ought to expect to have some searching, comparing, and
profiling of algorithms in their future, while those people dealing with len<10
sequences shouldn't have to think at all.
At any rate, I tried a few different sorted collections. The performance for
single-digit M was anywhere from 2.9x slower to 38x slower (8x with blist); the
crossover was around M=100, and you get 10x faster by around M=100K. Deciding
whether this is appropriate, and which implementation to use, and so on… well,
that's exactly why there's no sorted list in the stdlib in the first place.
More information about the Python-ideas