Uniquify attribute for lists

Hello, I just wanted to bring to your attention that an *attribute for removing duplicate elements* for lists would be a nice feature. *def uniquify(lis): seen = set() seen_add = seen.add return [ x for x in lis if x not in seen and not seen_add(x)]* * * The code is from this post<http://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-...>. Also check out this performance comparison<http://www.peterbe.com/plog/uniqifiers-benchmark> of uniquifying snippets. It would be useful to have a uniquify attribute for containers in general. Best regards, Robrecht * *

On 16 November 2012 12:39, Paul Moore <p.f.moore@gmail.com> wrote:
This loses order. Both solutions suffer from the problem that they only work with hashable objects. Michael
-- http://www.voidspace.org.uk/ May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give. -- the sqlite blessing http://www.sqlite.org/different.html

On 16/11/12 23:28, Robrecht De Rouck wrote:
That won't work for a general purpose function, because lists can hold unhashable items, and sets require hashable. Here's an old recipe predating sets that solves the problem in a number of different ways. Read the comments for versions that don't lose order. -- Steven

On 17/11/12 05:21, Steven D'Aprano wrote:
Here's an old recipe predating sets that solves the problem in a number of different ways. Read the comments for versions that don't lose order.
And it would help if I actually included the URL. Sorry about that. http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/ -- Steven

I created a fork of more-itertools, with unique_everseen modified to work with non-hashable elements. It starts off assuming everything is hashable, and using a set; if it gets a TypeError, it converts the set to a list and carries on. Very simple, and it's within 5% of the speed of the original in the best case.
From a quick glance at your recipe, it looks like you had the same idea long ago.
Meanwhile, I've been thinking that you don't really have to fall all the way back to a list if things aren't hashable; there are plenty of intermediate steps: There are plenty of types that aren't generally hashable, but you can be sure that your algorithm won't mutate them, and you can devise a hash function that guarantees the hashable requirement. For some easy examples, hash mutable sequences as (type(x), tuple(x)), mutable sets as (type(x), frozenset(x)), mutable mappings as (type(x), tuple(dict(x).items())), mutable buffers as (type(x), bytes(x)), etc. Or, if you have a bunch of your own mutable classes, maybe add a "fakehash" method to them and use that. As another option, a set of pickle.dumps(x) would work for many types, and it's still O(N), although with a huge multiplier, so it's not worth it unless len(sequence) >> avg(len(element)). Also, it's not guaranteed that x==y implies dumps(x)==dumps(y), so you'd need to restrict it to types for which this is known to be true. There are plenty of types that are not hashable, but are fully ordered, and (type(x), x) is fully ordered as long as all of the types are, so in such cases you can use a sorted collection (blist.sortedset?) and get O(N log N) time. Of course none of these works for all types, so you'd still have to fall back to linear searching through a list in some cases. At any rate, I don't think any of these alternatives needs to be added to a general-purpose uniquifier, but they should all be doable if your use case requires better performance for, e.g., a list of lists or a generator of mutable class objects or a huge collection of quickly-picklable objects. ----- Original Message ----

Surely the best choice is two have *two* caches; one for hashables and another for the rest. This might be improvable with a *third* chache if some non-hashables had total ordering, but that could also introduce bugs I think. It'd also be a lot harder and likely be slower anyway. Timings (approximate and rough): """ Using 16 elements % Hashable: 100 90 66 33 0 Original: 1.47 1.00 1.06 1.09 1.01 w/ key=str: 3.73 1.91 2.12 3.15 3.00 New: 1.20 1.46 1.81 2.13 3.38 w/ key=str: 1.72 2.00 2.48 2.76 3.01 Using 64 elements % Hashable: 100 90 66 33 0 Original: 1.15 1.29 1.61 1.64 1.43 w/ key=str: 1.98 2.50 3.09 3.55 3.99 New: 1.00 1.47 2.18 3.01 3.60 w/ key=str: 1.87 2.30 2.79 3.41 3.84 Using 256 elements % Hashable: 100 90 66 33 0 Original: 2.70 3.66 5.19 5.34 4.41 w/ key=str: 4.06 5.07 6.26 6.93 6.98 New: 1.00 1.65 2.92 5.28 7.62 w/ key=str: 2.28 2.71 3.76 4.36 4.93 Using 1024 elements % Hashable: 100 90 66 33 0 Original: 9.30 12.4 18.8 21.4 16.9 w/ key=str: 11.1 13.1 16.3 17.5 13.9 New: 1.00 1.84 6.20 13.1 19.8 w/ key=str: 2.31 2.79 3.59 4.50 5.16 Using 4096 elements % Hashable: 100 90 66 33 0 Original: 33.7 44.3 69.1 79.4 60.5 w/ key=str: 36.7 44.2 59.3 60.1 40.4 New: 1.00 3.73 18.1 42.2 63.7 w/ key=str: 2.23 2.56 3.33 4.19 4.93 Using 16384 elements % Hashable: 100 90 66 33 0 Original: 126. 173. 265. 313. 243. w/ key=str: 136. 164. 215. 213. 147. New: 1.00 12.5 68.6 173. 263. w/ key=str: 2.24 2.60 3.28 4.14 4.80 """ -------------- Code attached, unless I forget ;). No guarantees that it still works the same way, or works at all, or is the right file. Every item is repeated 5 times on average for any length of the list being unique-ified. I'll try it with this changed later. Basically, the new version is faster on all but ~100% non-hashable lists when there are more than ~1000 elements, and on more-hashable lists it's quadratically faster. When slower, it's by only about 10% to 20%. Of course, specifying whether or not your list is fully-hashable would be more efficient (probably 10%) but that's not as nice to use. Additionally, if you use key="" to uniquify by hashable values you are able to get a good speed up with the new version. Example of use: iteruniq(<list of set>, key=frozenset) With really small non-hashable lists, the original is significantly better (3x).

From: Joshua Landau <joshua.landau.ws@gmail.com> Sent: Sat, November 17, 2012 11:38:22 AM
Surely the best choice is two have *two* caches; one for hashables and another for the rest.
Your implementation does a try: hash() to decide whether to check the set or the list, instead of just doing a try: item in set except: item in list. Is there a reason for this? It's more complicated, and it's measurably slower.
This might be improvable with a *third* chache if some non-hashables had total ordering, but that could also introduce bugs I think. It'd also be a lot harder
and likely be slower anyway.
I agree that it's probably not worth adding to something in the standard library, or a recipe given in the documentation (in fact, I think I already said as much earlier in the thread), but I think you've got most of those facts wrong. It's not a lot harder. The same 4 lines you have to add to do a try-set-except-list, you just do again, so it's try-set-except-try-sortedlist-except-list. And it won't introduce any bugs. And as for speed, it'll be O(NlogM) instead of O(NM) for N elements with M unique, which is obviously better, but probably slower for tiny M, and another 5-10% overhead for inappropriate values. The problem is finding an appropriate sortedcollection type. There isn't one in the standard library. There's a link to an external SortedCollection reference in the bisect docs page, but that has O(N) insertion time, so it won't help. The most popular library I know of is blist.sortedlist, and that works, but it has quite a bit of overhead for searching tiny lists. As I understand it, the reason there isn't a standard sorted collection is the assumption that people dealing with huge sequences ought to expect to have some searching, comparing, and profiling of algorithms in their future, while those people dealing with len<10 sequences shouldn't have to think at all. At any rate, I tried a few different sorted collections. The performance for single-digit M was anywhere from 2.9x slower to 38x slower (8x with blist); the crossover was around M=100, and you get 10x faster by around M=100K. Deciding whether this is appropriate, and which implementation to use, and so on… well, that's exactly why there's no sorted list in the stdlib in the first place.

On 22 November 2012 00:06, Andrew Barnert <abarnert@yahoo.com> wrote:
I did not realise that "[] in set()" raised an error! I'd just assumed it returned False. Thank you, this does make small but significant difference.
Well, I'd sort-of assumed that this included adding sorted collection to the mix, as it isn't in the standard library.
And it won't introduce any bugs.
This took me a while to prove, so I'm proud of this:
You *cannot* assume that a data set has total ordering on the basis that it's working so far.
Well yes... bar the fact that you may be using it on something with a non-even distribution of "things" where some types are not comparable to each-other: [ {1, 2}, [3, 4], [1, 2], [7, 4], [2, 3], (5, 2), [2, 1] ... ] Where you'll get nowhere near O(NlogM). *And* then there's the fact that sorted collections have intrinsically more overhead, and so are likely to give large overhead. The problem is finding an appropriate sortedcollection type. There isn't
Thank you for the numbers. May I ask what libraries you used?

From: Joshua Landau <joshua.landau.ws@gmail.com> Sent: Thu, November 22, 2012 2:34:30 PM
another
I did not realise that "[] in set()" raised an error! I'd just assumed it returned False.
I just realized that this doesn't seem to be documented anywhere. It's obvious that set.add would have to raise for a non-hashable, but x in set could be implemented as either raising or returning False without violating any of the requirements at http://docs.python.org/3/library/stdtypes.html#set or anywhere else that I can see… I did a quick test with PyPy and Jython built-in sets, the old sets module, and the Python recipe that used to be linked for pre-2.3 compat, and they all do the same thing as CPython. (The pure Python versions are all just implemented as a dict where all the values are True.) But that still doesn't necessarily guarantee that it's safe in all possible future implementations… Maybe the documentation should be updated to guarantee this—it's a useful thing to rely on, all current implementations provide it, and it's hard to think of a good reason why breaking it could improve performance or implementation simplicity.
Well, I'd sort-of assumed that this included adding sorted collection to the mix, as it isn't in the standard library.
Yes, as I said later, that's the biggest reason not to consider it as a general solution.
You *cannot* assume that a data set has total ordering on the basis that it's working so far.
You're right. I was thinking that a sorted collection should reject adding elements that aren't totally ordered with the existing elements… but now that I think about it, there's obviously no way to do that in O(log N) time.
distribution of "things" where some types are not comparable to each-other:
I didn't speak very precisely here, because it's hard to be concise, but the total performance is O(A) + O(BlogM) + O(CN), where A is the number of hashable elements, B is the number of non-hashable but sortable elements that are comparable to the first non-hashable but sortable element, M is the number of unique elements within B, C is the number of elements that are neither hashable nor comparable with the elements of B, and N is the number of unique elements within C. The point is that, if a significant subset of the elements are in B, this will be better than O(A)+O(CN); otherwise, it'll be the same. Just as O(A)+O(CN) is better than O(CN) if a significant subset of the elements are in A, otherwise the same. So, it's an improvement in the same way that adding the set is an improvement.
*And* then there's the fact that sorted collections have intrinsically more overhead, and so are likely to give large overhead.
At any rate, I tried a few different sorted collections. The performance for single-digit M was anywhere from 2.9x slower to 38x slower (8x with blist);
I mentioned that later (and you commented on it). When M is very small (especially if B is also very large), there's a substantial added cost. Of course the same is true for the set, but the crossover for the set happens somewhere between 2-10 unique elements instead of 100, and the cost below that crossover is much smaller. the
Thank you for the numbers. May I ask what libraries you used?
* blist (PyPI): hybrid B+Tree/array * pyavl (PyPI): AVL tree * bintrees (PyPI): AVL tree and red-black tree * opus7 (http://www.brpreiss.com/books/opus7/): AVL tree * libc++ std::set (incomplete hacked-up Cython interface): red-black tree * CHDataStructures (via PyObjC): not sure * java.util.TreeSet (via jython): red-black tree * java.util.concurrrent.ConcurrentSkipListSet: skip-list * QtCore.QMap (via PyQt4): skip-list Some of these are hacked-up implementations that only handle just enough of the interface I need, in some cases even faking the comparisons, and in some cases not even complete enough to run the real test (so I just tested the time to test-and-insert B values M/B values). So, it's not a scientific test or anything, but they were all in the expected ballpark (and the few that weren't turned out not to be O(log N) insert time, so I threw them out). The most thorough tests were with blist; I think I posted the complete numbers before, but the short version is: 8.0x with 2 unique values; 1.1x with 64; 0.9x with 128; 0.1x with 128K, all with 256K total values.

If I understand this right, the problem you want to solve is that there is no obvious way to uniquify lists that's order preserving and efficient, so you want a good implementation to be added as an attribute of the list type. Right? As others have pointed out, your implementation only works for lists with hashable elements, so no lists of lists, for example. Also, making it an attribute of list means you can't use it on, say, a tuple, or a dict key iterator, or a file. Why restrict it like that? I'd much rather have an itertools.uniquify(seq) than a list method. (If I'm just misreading your use of the word "attribute", I apologize.) And, once it's a separate function rather than a member of list, why do you want it to return a list rather than a generator? All that being said, if getting this right is difficult enough that a bunch of people working together on a blog over 6 years didn't come up with a good version that supports non-hashable elements, maybe a good implementation does belong in the standard library itertools. Sent from my iPhone On Nov 16, 2012, at 4:28, Robrecht De Rouck <de.rouck.robrecht@gmail.com> wrote:

From: Andrew Barnert <abarnert@yahoo.com> Sent: Fri, November 16, 2012 11:15:18 AM
Actually, it looks like it's already there. The existing unique_everseen function in http://docs.python.org/3/library/itertools.html#itertools-recipes (also available from the more-itertools PyPI module at http://packages.python.org/more-itertools/api.html#more_itertools.unique_eve...) is an improvement on this idea. So, unless someone has done performance tests showing that the suggested implementation is significantly faster than unique_everseen (I suppose the "__contains__" vs. "in" might make a difference?), and this is a critical bottleneck for your app, I think the right way to write this function is: uniquify = more_itertools.unique_everseen Unfortunately, it's still not going to work on non-hashable elements. Maybe itertools (either the module or the documentation recipe list) needs a version that does?

Here is my quick'n'dirty implementation of different variants (with/without key, preserving/not preserving order, accepting hashable-only/any items): http://wklej.org/id/872623/ Timeit-ed on my machine: $ python3.3 iteruniq.py test1_nokey_noorder_hashonly [0.08257626800332218, 0.08304202905856073, 0.08718552498612553] test2_nokey_noorder_universal [2.48601198696997, 2.4620621589710936, 2.453364996938035] test3_nokey_withorder_hashonly [0.3661507030483335, 0.3646505419164896, 0.36500189593061805] test4_nokey_withorder_universal [7.532308181049302, 7.397191203082912, 7.316833758028224] test5_withkey_noorder_hashonly [0.9567891559563577, 0.9690931889927015, 0.9598639439791441] test6_withkey_noorder_universal [3.142076837946661, 3.144917198107578, 3.150129645015113] test7_withkey_withorder_hashonly [0.9493958179373294, 0.9514245060272515, 0.9517305289627984] test8_withkey_withorder_universal [10.233501984039322, 10.404869885998778, 10.786898656049743] Cheers. *j

Comparing your test3 and test7 to the equivalent calls with the itertools recipe, it's about 32% slower in 2.7 and 46% slower in 3.3. But the added flexibility might easily make up for the cost—it certainly does if you need it… More interestingly, with this: if hashable_only is None: try: return iteruniq(iterable, key, preserve_order, True) except TypeError: return iteruniq(iterable, key, preserve_order, False) … hashable_only=None is only 8% slower than hashable_only=False when you have non-hashables, and 91% faster when you don't. (And trying unique_everseen instead of iteruniq if hashable_only is None and preserve_order makes that 7% and 94%.) The current unique_everseen is already by far the longest recipe on the itertools docs page, but it still might be worth updating with a synthesized best-of-all-options version, or actually adding to itertools instead of leaving as a recipe. ----- Original Message ----

On 16 November 2012 12:39, Paul Moore <p.f.moore@gmail.com> wrote:
This loses order. Both solutions suffer from the problem that they only work with hashable objects. Michael
-- http://www.voidspace.org.uk/ May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give. -- the sqlite blessing http://www.sqlite.org/different.html

On 16/11/12 23:28, Robrecht De Rouck wrote:
That won't work for a general purpose function, because lists can hold unhashable items, and sets require hashable. Here's an old recipe predating sets that solves the problem in a number of different ways. Read the comments for versions that don't lose order. -- Steven

On 17/11/12 05:21, Steven D'Aprano wrote:
Here's an old recipe predating sets that solves the problem in a number of different ways. Read the comments for versions that don't lose order.
And it would help if I actually included the URL. Sorry about that. http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/ -- Steven

I created a fork of more-itertools, with unique_everseen modified to work with non-hashable elements. It starts off assuming everything is hashable, and using a set; if it gets a TypeError, it converts the set to a list and carries on. Very simple, and it's within 5% of the speed of the original in the best case.
From a quick glance at your recipe, it looks like you had the same idea long ago.
Meanwhile, I've been thinking that you don't really have to fall all the way back to a list if things aren't hashable; there are plenty of intermediate steps: There are plenty of types that aren't generally hashable, but you can be sure that your algorithm won't mutate them, and you can devise a hash function that guarantees the hashable requirement. For some easy examples, hash mutable sequences as (type(x), tuple(x)), mutable sets as (type(x), frozenset(x)), mutable mappings as (type(x), tuple(dict(x).items())), mutable buffers as (type(x), bytes(x)), etc. Or, if you have a bunch of your own mutable classes, maybe add a "fakehash" method to them and use that. As another option, a set of pickle.dumps(x) would work for many types, and it's still O(N), although with a huge multiplier, so it's not worth it unless len(sequence) >> avg(len(element)). Also, it's not guaranteed that x==y implies dumps(x)==dumps(y), so you'd need to restrict it to types for which this is known to be true. There are plenty of types that are not hashable, but are fully ordered, and (type(x), x) is fully ordered as long as all of the types are, so in such cases you can use a sorted collection (blist.sortedset?) and get O(N log N) time. Of course none of these works for all types, so you'd still have to fall back to linear searching through a list in some cases. At any rate, I don't think any of these alternatives needs to be added to a general-purpose uniquifier, but they should all be doable if your use case requires better performance for, e.g., a list of lists or a generator of mutable class objects or a huge collection of quickly-picklable objects. ----- Original Message ----

Surely the best choice is two have *two* caches; one for hashables and another for the rest. This might be improvable with a *third* chache if some non-hashables had total ordering, but that could also introduce bugs I think. It'd also be a lot harder and likely be slower anyway. Timings (approximate and rough): """ Using 16 elements % Hashable: 100 90 66 33 0 Original: 1.47 1.00 1.06 1.09 1.01 w/ key=str: 3.73 1.91 2.12 3.15 3.00 New: 1.20 1.46 1.81 2.13 3.38 w/ key=str: 1.72 2.00 2.48 2.76 3.01 Using 64 elements % Hashable: 100 90 66 33 0 Original: 1.15 1.29 1.61 1.64 1.43 w/ key=str: 1.98 2.50 3.09 3.55 3.99 New: 1.00 1.47 2.18 3.01 3.60 w/ key=str: 1.87 2.30 2.79 3.41 3.84 Using 256 elements % Hashable: 100 90 66 33 0 Original: 2.70 3.66 5.19 5.34 4.41 w/ key=str: 4.06 5.07 6.26 6.93 6.98 New: 1.00 1.65 2.92 5.28 7.62 w/ key=str: 2.28 2.71 3.76 4.36 4.93 Using 1024 elements % Hashable: 100 90 66 33 0 Original: 9.30 12.4 18.8 21.4 16.9 w/ key=str: 11.1 13.1 16.3 17.5 13.9 New: 1.00 1.84 6.20 13.1 19.8 w/ key=str: 2.31 2.79 3.59 4.50 5.16 Using 4096 elements % Hashable: 100 90 66 33 0 Original: 33.7 44.3 69.1 79.4 60.5 w/ key=str: 36.7 44.2 59.3 60.1 40.4 New: 1.00 3.73 18.1 42.2 63.7 w/ key=str: 2.23 2.56 3.33 4.19 4.93 Using 16384 elements % Hashable: 100 90 66 33 0 Original: 126. 173. 265. 313. 243. w/ key=str: 136. 164. 215. 213. 147. New: 1.00 12.5 68.6 173. 263. w/ key=str: 2.24 2.60 3.28 4.14 4.80 """ -------------- Code attached, unless I forget ;). No guarantees that it still works the same way, or works at all, or is the right file. Every item is repeated 5 times on average for any length of the list being unique-ified. I'll try it with this changed later. Basically, the new version is faster on all but ~100% non-hashable lists when there are more than ~1000 elements, and on more-hashable lists it's quadratically faster. When slower, it's by only about 10% to 20%. Of course, specifying whether or not your list is fully-hashable would be more efficient (probably 10%) but that's not as nice to use. Additionally, if you use key="" to uniquify by hashable values you are able to get a good speed up with the new version. Example of use: iteruniq(<list of set>, key=frozenset) With really small non-hashable lists, the original is significantly better (3x).

From: Joshua Landau <joshua.landau.ws@gmail.com> Sent: Sat, November 17, 2012 11:38:22 AM
Surely the best choice is two have *two* caches; one for hashables and another for the rest.
Your implementation does a try: hash() to decide whether to check the set or the list, instead of just doing a try: item in set except: item in list. Is there a reason for this? It's more complicated, and it's measurably slower.
This might be improvable with a *third* chache if some non-hashables had total ordering, but that could also introduce bugs I think. It'd also be a lot harder
and likely be slower anyway.
I agree that it's probably not worth adding to something in the standard library, or a recipe given in the documentation (in fact, I think I already said as much earlier in the thread), but I think you've got most of those facts wrong. It's not a lot harder. The same 4 lines you have to add to do a try-set-except-list, you just do again, so it's try-set-except-try-sortedlist-except-list. And it won't introduce any bugs. And as for speed, it'll be O(NlogM) instead of O(NM) for N elements with M unique, which is obviously better, but probably slower for tiny M, and another 5-10% overhead for inappropriate values. The problem is finding an appropriate sortedcollection type. There isn't one in the standard library. There's a link to an external SortedCollection reference in the bisect docs page, but that has O(N) insertion time, so it won't help. The most popular library I know of is blist.sortedlist, and that works, but it has quite a bit of overhead for searching tiny lists. As I understand it, the reason there isn't a standard sorted collection is the assumption that people dealing with huge sequences ought to expect to have some searching, comparing, and profiling of algorithms in their future, while those people dealing with len<10 sequences shouldn't have to think at all. At any rate, I tried a few different sorted collections. The performance for single-digit M was anywhere from 2.9x slower to 38x slower (8x with blist); the crossover was around M=100, and you get 10x faster by around M=100K. Deciding whether this is appropriate, and which implementation to use, and so on… well, that's exactly why there's no sorted list in the stdlib in the first place.

On 22 November 2012 00:06, Andrew Barnert <abarnert@yahoo.com> wrote:
I did not realise that "[] in set()" raised an error! I'd just assumed it returned False. Thank you, this does make small but significant difference.
Well, I'd sort-of assumed that this included adding sorted collection to the mix, as it isn't in the standard library.
And it won't introduce any bugs.
This took me a while to prove, so I'm proud of this:
You *cannot* assume that a data set has total ordering on the basis that it's working so far.
Well yes... bar the fact that you may be using it on something with a non-even distribution of "things" where some types are not comparable to each-other: [ {1, 2}, [3, 4], [1, 2], [7, 4], [2, 3], (5, 2), [2, 1] ... ] Where you'll get nowhere near O(NlogM). *And* then there's the fact that sorted collections have intrinsically more overhead, and so are likely to give large overhead. The problem is finding an appropriate sortedcollection type. There isn't
Thank you for the numbers. May I ask what libraries you used?

From: Joshua Landau <joshua.landau.ws@gmail.com> Sent: Thu, November 22, 2012 2:34:30 PM
another
I did not realise that "[] in set()" raised an error! I'd just assumed it returned False.
I just realized that this doesn't seem to be documented anywhere. It's obvious that set.add would have to raise for a non-hashable, but x in set could be implemented as either raising or returning False without violating any of the requirements at http://docs.python.org/3/library/stdtypes.html#set or anywhere else that I can see… I did a quick test with PyPy and Jython built-in sets, the old sets module, and the Python recipe that used to be linked for pre-2.3 compat, and they all do the same thing as CPython. (The pure Python versions are all just implemented as a dict where all the values are True.) But that still doesn't necessarily guarantee that it's safe in all possible future implementations… Maybe the documentation should be updated to guarantee this—it's a useful thing to rely on, all current implementations provide it, and it's hard to think of a good reason why breaking it could improve performance or implementation simplicity.
Well, I'd sort-of assumed that this included adding sorted collection to the mix, as it isn't in the standard library.
Yes, as I said later, that's the biggest reason not to consider it as a general solution.
You *cannot* assume that a data set has total ordering on the basis that it's working so far.
You're right. I was thinking that a sorted collection should reject adding elements that aren't totally ordered with the existing elements… but now that I think about it, there's obviously no way to do that in O(log N) time.
distribution of "things" where some types are not comparable to each-other:
I didn't speak very precisely here, because it's hard to be concise, but the total performance is O(A) + O(BlogM) + O(CN), where A is the number of hashable elements, B is the number of non-hashable but sortable elements that are comparable to the first non-hashable but sortable element, M is the number of unique elements within B, C is the number of elements that are neither hashable nor comparable with the elements of B, and N is the number of unique elements within C. The point is that, if a significant subset of the elements are in B, this will be better than O(A)+O(CN); otherwise, it'll be the same. Just as O(A)+O(CN) is better than O(CN) if a significant subset of the elements are in A, otherwise the same. So, it's an improvement in the same way that adding the set is an improvement.
*And* then there's the fact that sorted collections have intrinsically more overhead, and so are likely to give large overhead.
At any rate, I tried a few different sorted collections. The performance for single-digit M was anywhere from 2.9x slower to 38x slower (8x with blist);
I mentioned that later (and you commented on it). When M is very small (especially if B is also very large), there's a substantial added cost. Of course the same is true for the set, but the crossover for the set happens somewhere between 2-10 unique elements instead of 100, and the cost below that crossover is much smaller. the
Thank you for the numbers. May I ask what libraries you used?
* blist (PyPI): hybrid B+Tree/array * pyavl (PyPI): AVL tree * bintrees (PyPI): AVL tree and red-black tree * opus7 (http://www.brpreiss.com/books/opus7/): AVL tree * libc++ std::set (incomplete hacked-up Cython interface): red-black tree * CHDataStructures (via PyObjC): not sure * java.util.TreeSet (via jython): red-black tree * java.util.concurrrent.ConcurrentSkipListSet: skip-list * QtCore.QMap (via PyQt4): skip-list Some of these are hacked-up implementations that only handle just enough of the interface I need, in some cases even faking the comparisons, and in some cases not even complete enough to run the real test (so I just tested the time to test-and-insert B values M/B values). So, it's not a scientific test or anything, but they were all in the expected ballpark (and the few that weren't turned out not to be O(log N) insert time, so I threw them out). The most thorough tests were with blist; I think I posted the complete numbers before, but the short version is: 8.0x with 2 unique values; 1.1x with 64; 0.9x with 128; 0.1x with 128K, all with 256K total values.

If I understand this right, the problem you want to solve is that there is no obvious way to uniquify lists that's order preserving and efficient, so you want a good implementation to be added as an attribute of the list type. Right? As others have pointed out, your implementation only works for lists with hashable elements, so no lists of lists, for example. Also, making it an attribute of list means you can't use it on, say, a tuple, or a dict key iterator, or a file. Why restrict it like that? I'd much rather have an itertools.uniquify(seq) than a list method. (If I'm just misreading your use of the word "attribute", I apologize.) And, once it's a separate function rather than a member of list, why do you want it to return a list rather than a generator? All that being said, if getting this right is difficult enough that a bunch of people working together on a blog over 6 years didn't come up with a good version that supports non-hashable elements, maybe a good implementation does belong in the standard library itertools. Sent from my iPhone On Nov 16, 2012, at 4:28, Robrecht De Rouck <de.rouck.robrecht@gmail.com> wrote:

From: Andrew Barnert <abarnert@yahoo.com> Sent: Fri, November 16, 2012 11:15:18 AM
Actually, it looks like it's already there. The existing unique_everseen function in http://docs.python.org/3/library/itertools.html#itertools-recipes (also available from the more-itertools PyPI module at http://packages.python.org/more-itertools/api.html#more_itertools.unique_eve...) is an improvement on this idea. So, unless someone has done performance tests showing that the suggested implementation is significantly faster than unique_everseen (I suppose the "__contains__" vs. "in" might make a difference?), and this is a critical bottleneck for your app, I think the right way to write this function is: uniquify = more_itertools.unique_everseen Unfortunately, it's still not going to work on non-hashable elements. Maybe itertools (either the module or the documentation recipe list) needs a version that does?

Here is my quick'n'dirty implementation of different variants (with/without key, preserving/not preserving order, accepting hashable-only/any items): http://wklej.org/id/872623/ Timeit-ed on my machine: $ python3.3 iteruniq.py test1_nokey_noorder_hashonly [0.08257626800332218, 0.08304202905856073, 0.08718552498612553] test2_nokey_noorder_universal [2.48601198696997, 2.4620621589710936, 2.453364996938035] test3_nokey_withorder_hashonly [0.3661507030483335, 0.3646505419164896, 0.36500189593061805] test4_nokey_withorder_universal [7.532308181049302, 7.397191203082912, 7.316833758028224] test5_withkey_noorder_hashonly [0.9567891559563577, 0.9690931889927015, 0.9598639439791441] test6_withkey_noorder_universal [3.142076837946661, 3.144917198107578, 3.150129645015113] test7_withkey_withorder_hashonly [0.9493958179373294, 0.9514245060272515, 0.9517305289627984] test8_withkey_withorder_universal [10.233501984039322, 10.404869885998778, 10.786898656049743] Cheers. *j

Comparing your test3 and test7 to the equivalent calls with the itertools recipe, it's about 32% slower in 2.7 and 46% slower in 3.3. But the added flexibility might easily make up for the cost—it certainly does if you need it… More interestingly, with this: if hashable_only is None: try: return iteruniq(iterable, key, preserve_order, True) except TypeError: return iteruniq(iterable, key, preserve_order, False) … hashable_only=None is only 8% slower than hashable_only=False when you have non-hashables, and 91% faster when you don't. (And trying unique_everseen instead of iteruniq if hashable_only is None and preserve_order makes that 7% and 94%.) The current unique_everseen is already by far the longest recipe on the itertools docs page, but it still might be worth updating with a synthesized best-of-all-options version, or actually adding to itertools instead of leaving as a recipe. ----- Original Message ----
participants (10)
-
Andrew Barnert
-
Jan Kaliszewski
-
Jasper St. Pierre
-
Joshua Landau
-
Masklinn
-
Michael Foord
-
Paul Moore
-
Robrecht De Rouck
-
Serhiy Storchaka
-
Steven D'Aprano