[Python-ideas] Uniquify attribute for lists

Andrew Barnert abarnert at yahoo.com
Sat Nov 17 04:17:59 CET 2012


I created a fork of more-itertools, with unique_everseen modified to work with 
non-hashable elements. It starts off assuming everything is hashable, and using 
a set; if it gets a TypeError, it converts the set to a list and carries on. 
Very simple, and it's within 5% of the speed of the original in the best case.

>From a quick glance at your recipe, it looks like you had the same idea long 
ago.

Meanwhile, I've been thinking that you don't really have to fall all the way 
back to a list if things aren't hashable; there are plenty of intermediate 
steps:

There are plenty of types that aren't generally hashable, but you can be sure 
that your algorithm won't mutate them, and you can devise a hash function that 
guarantees the hashable requirement. For some easy examples, hash mutable 
sequences as (type(x), tuple(x)), mutable sets as (type(x), frozenset(x)), 
mutable mappings as (type(x), tuple(dict(x).items())), mutable buffers as 
(type(x), bytes(x)), etc. Or, if you have a bunch of your own mutable classes, 
maybe add a "fakehash" method to them and use that.

As another option, a set of pickle.dumps(x) would work for many types, and it's 
still O(N), although with a huge multiplier, so it's not worth it 
unless len(sequence) >> avg(len(element)). Also, it's not guaranteed that x==y 
implies dumps(x)==dumps(y), so you'd need to restrict it to types for which this 
is known to be true.

There are plenty of types that are not hashable, but are fully ordered, and 
(type(x), x) is fully ordered as long as all of the types are, so in such cases 
you can use a sorted collection (blist.sortedset?) and get O(N log N) time.

Of course none of these works for all types, so you'd still have to fall back to 
linear searching through a list in some cases.

At any rate, I don't think any of these alternatives needs to be added to a 
general-purpose uniquifier, but they should all be doable if your use case 
requires better performance for, e.g., a list of lists or a generator of mutable 
class objects or a huge collection of quickly-picklable objects.


----- Original Message ----
> From: Steven D'Aprano <steve at pearwood.info>
> To: python-ideas at python.org
> Sent: Fri, November 16, 2012 5:18:02 PM
> Subject: Re: [Python-ideas] Uniquify attribute for lists
> 
> On 17/11/12 05:21, Steven D'Aprano wrote:
> 
> > Here's an old recipe  predating sets that solves the problem in a number
> > of different ways.  Read the comments for versions that don't lose order.
> 
> And it would help  if I actually included the URL. Sorry about  that.
> 
> 
> http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/
> 
> 
> 
> 
> --  Steven
> _______________________________________________
> Python-ideas  mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
> 



More information about the Python-ideas mailing list