[Python-ideas] Uniquify attribute for lists

Thu Nov 22 01:06:07 CET 2012

From: Joshua Landau <joshua.landau.ws at gmail.com>
Sent: Sat, November 17, 2012 11:38:22 AM

>Surely the best choice is two have *two* caches; one for hashables and another 
>for the rest.

Your implementation does a try: hash() to decide whether to check the set or the 
list, instead of just doing a try: item in set except: item in list. Is there a 
reason for this? It's more complicated, and it's measurably slower.

>This might be improvable with a *third* chache if some non-hashables had total 
>ordering, but that could also introduce bugs I think. It'd also be a lot harder 

>and likely be slower anyway.

I agree that it's probably not worth adding to something in the standard 
library, or a recipe given in the documentation (in fact, I think I already said 
as much earlier in the thread), but I think you've got most of those facts 
wrong.

It's not a lot harder. The same 4 lines you have to add to do a 
try-set-except-list, you just do again, so it's 
try-set-except-try-sortedlist-except-list. And it won't introduce any bugs. And 
as for speed, it'll be O(NlogM) instead of O(NM) for N elements with M unique, 
which is obviously better, but probably slower for tiny M, and another 5-10% 
overhead for inappropriate values.

The problem is finding an appropriate sortedcollection type. There isn't one in 
the standard library. There's a link to an external SortedCollection reference 
in the bisect docs page, but that has O(N) insertion time, so it won't help. The 
most popular library I know of is blist.sortedlist, and that works, but it has 
quite a bit of overhead for searching tiny lists. As I understand it, the reason 
there isn't a standard sorted collection is the assumption that people dealing 
with huge sequences ought to expect  to have some searching, comparing, and 
profiling of algorithms in their  future, while those people dealing with len<10 
sequences shouldn't have to think at all.

At any rate, I tried a few different sorted collections. The performance for 
single-digit M was anywhere from 2.9x slower to 38x slower (8x with blist); the 
crossover was around M=100, and you get 10x faster by around M=100K. Deciding 
whether this is appropriate, and which implementation to use, and so on… well, 
that's exactly why there's no sorted list in the stdlib in the first place.