[python-win32] Identify unique data from sequence array

otrov dejan.org at gmail.com
Wed Dec 22 18:21:27 CET 2010


> I'm a unix guy.  That's what we call a sort-uniq operation, after the
> pipeline we'd use: sort datafile | uniq > uniq-lines.txt.  So I google that
> with python and ....

> As Jason Petrone wrote when he withdrew PEP 270  in
> http://www.python.org/dev/peps/pep-0270/:


> "creating a sequence without duplicates is just a matter of
> choosing a different data structure: a set instead of a list."


> At the time, sets.py was a nifty new thing.  Since then, the set datatype
> has
> been added to python's base.

> set() can consume a list of tuples, but not a list of lists, like the X you
> showed us.  You're job will be getting your massive list of lists into a
> list of tuples.

> This works, but for your very large arrays, may take large time:

> X = [[1,2], [1,2], [3,4], [3,4]]

> Y = set( [tuple(x) for x in X] )


> There may be faster methods.  The map() function might help, but I really
> don't know.  Here's something to try:

> Y = set( map(tuple, X )


> Or you can go old school route, from before the days of set(), that is:

> http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/


> Best,
> Mike

Thanks for your reply, but perhaps there is misunderstanding:

I don't want unique values, but unique sequence (block) of data that is repeated in array:

A B C D D D A B C D D D A B C D D D
|_________| |_________| |_________|
     |           |           |
   unique      unique      unique
  sequence    sequence    sequence
    data        data        data

I tested your approach and won't say it's slow. It works great but that's not what I'm after. Thanks anyway

Cheers



More information about the python-win32 mailing list