Hi, Numpy's unique(x) returns an array x with repetitions removed. However, since it returns asarray(dict.keys()), the resulting array is not sorted, worse, the original order may not be conserved. I think that unique() should return a sorted array, like its matlab homonym. Regards, David Huard
I agree. Currently the order of the output of unique is undefined. Defining it in such a way that it produces a sorted array will not break any compatibility. My idea would be something like def unique(arr,sort=True): if hasattr(arr,'flatten'): tmp = arr.flatten() tmp.sort() idx = concatenate([True],tmp[1:]!=tmp[:1]) return tmp[idx] else: # for compatibility: set = {} for item in inseq: set[item] = None if sort: return asarray(sorted(set.keys())) else: return asarray(set.keys()) Does anybody know about the internals of the python "set"? How is .keys() implemented? I somehow have really doubts about the efficiency of this method. David Huard wrote:
Hi,
Numpy's unique(x) returns an array x with repetitions removed. However, since it returns asarray(dict.keys()), the resulting array is not sorted, worse, the original order may not be conserved. I think that unique() should return a sorted array, like its matlab homonym.
Regards,
David Huard 
Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with preintegrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.asus.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 
_______________________________________________ Numpydiscussion mailing list Numpydiscussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpydiscussion
Norbert Nemec wrote:
I agree.
Currently the order of the output of unique is undefined. Defining it in such a way that it produces a sorted array will not break any compatibility.
My idea would be something like
def unique(arr,sort=True): if hasattr(arr,'flatten'): tmp = arr.flatten() tmp.sort() idx = concatenate([True],tmp[1:]!=tmp[:1]) return tmp[idx] else: # for compatibility: set = {} for item in inseq: set[item] = None if sort: return asarray(sorted(set.keys())) else: return asarray(set.keys())
Does anybody know about the internals of the python "set"? How is .keys() implemented? I somehow have really doubts about the efficiency of this method.
Well, that's a dictionary, not a set, but they both use the same algorithm. They are both hash tables. If you need more specific details about how the hash tables are implemented, the source (Object/dictobject.c)is the best place for them.  Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth."  Umberto Eco
On 7/2/06, Norbert Nemec <Norbert.Nemec.list@gmx.de> wrote:
... Does anybody know about the internals of the python "set"? How is .keys() implemented? I somehow have really doubts about the efficiency of this method.
Set implementation (Objects/setobject.c) is a copy and paste job from dictobject with values removed. As a result it is heavily optimized for the case of string valued keys  a case that is almost irrelevant for numpy. I think something like the following (untested, 1d only) will probably be much faster and sorted: def unique(x): s = sort(x) r = empty_like(s) r[:1] = s[1:] r[1] = s[0] return s[r != s]
Sasha wrote:
On 7/2/06, Norbert Nemec <Norbert.Nemec.list@gmx.de> wrote:
... Does anybody know about the internals of the python "set"? How is .keys() implemented? I somehow have really doubts about the efficiency of this method.
Set implementation (Objects/setobject.c) is a copy and paste job from dictobject with values removed. As a result it is heavily optimized for the case of string valued keys  a case that is almost irrelevant for numpy.
I think something like the following (untested, 1d only) will probably be much faster and sorted:
def unique(x): s = sort(x) r = empty_like(s) r[:1] = s[1:] r[1] = s[0] return s[r != s]
There are 1d array set operations like this already in numpy (numpy/lib/arraysetops.py  unique1d, ...) r.
Here is a quick benchmark between numpy's unique, unique1d and sasha's unique: x = rand(100000)*100 x = x.astype('i') %timeit unique(x) 10 loops, best of 3: 525 ms per loop %timeit unique_sasha(x) 100 loops, best of 3: 10.7 ms per loop timeit unique1d(x) 100 loops, best of 3: 12.6 ms per loop So I wonder what is the added value of unique? Could unique1d simply become unique ? Cheers, David P.S. I modified sasha's version to account for the case where all elements are identical, which returned an empty array. def unique_sasha(x): s = sort(x) r = empty(s.shape, float) r[:1] = s[1:] r[1] = NaN return s[r != s] 2006/7/3, Robert Cimrman <cimrman3@ntc.zcu.cz>:
Sasha wrote:
On 7/2/06, Norbert Nemec <Norbert.Nemec.list@gmx.de> wrote:
... Does anybody know about the internals of the python "set"? How is .keys() implemented? I somehow have really doubts about the efficiency of this method.
Set implementation (Objects/setobject.c) is a copy and paste job from dictobject with values removed. As a result it is heavily optimized for the case of string valued keys  a case that is almost irrelevant for numpy.
I think something like the following (untested, 1d only) will probably be much faster and sorted:
def unique(x): s = sort(x) r = empty_like(s) r[:1] = s[1:] r[1] = s[0] return s[r != s]
There are 1d array set operations like this already in numpy (numpy/lib/arraysetops.py  unique1d, ...)
r.
Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with preintegrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.asus.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Numpydiscussion mailing list Numpydiscussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpydiscussion
David Huard wrote:
Here is a quick benchmark between numpy's unique, unique1d and sasha's unique:
x = rand(100000)*100 x = x.astype('i')
%timeit unique(x) 10 loops, best of 3: 525 ms per loop
%timeit unique_sasha(x) 100 loops, best of 3: 10.7 ms per loop
timeit unique1d(x) 100 loops, best of 3: 12.6 ms per loop
So I wonder what is the added value of unique? Could unique1d simply become unique ?
It looks like unique1d and friends could use same facelifting with new numpy features like boolean indexing :) r.
unique1d is based on ediff1d, so it really calculates many differences and compares those to 0.0 This is inefficient, even though this is hidden by the general inefficiency of Python (It might be the reason for the two milliseconds, though) What is more: subtraction works only for numbers, while the various proposed versions use only comparison which works for any data type (as long as it can be sorted) My own version tried to capture all possible cases that the current unique captures. Sasha's version only works for numpy arrays and has a problem for arrays with all identical entries. David's version only works for numpy arrays of types that can be converted to float. I would once more propose to use my own version as given before: def unique(arr,sort=True): if hasattr(arr,'flatten'): tmp = arr.flatten() tmp.sort() idx = concatenate([True],tmp[1:]!=tmp[:1]) return tmp[idx] else: # for compatibility: set = {} for item in inseq: set[item] = None if sort: return asarray(sorted(set.keys())) else: return asarray(set.keys()) Greetings, Norbert
On Tue, 11 Jul 2006, Norbert Nemec apparently wrote:
else: # for compatibility: set = {} for item in inseq: set[item] = None if sort: return asarray(sorted(set.keys())) else: return asarray(set.keys())
I'm currently in major caffeine deficit, but aside from backward compatability, how is this better than: else: #for compatability items = list(set(inseq)) if sort: items.sort() return asarray(items) Alan Isaac PS Actually, making a list of a set may already sort? No time to check now ... PPS For to 2.3, need set = sets.Set
Norbert Nemec wrote:
unique1d is based on ediff1d, so it really calculates many differences and compares those to 0.0
This is inefficient, even though this is hidden by the general inefficiency of Python (It might be the reason for the two milliseconds, though)
What is more: subtraction works only for numbers, while the various proposed versions use only comparison which works for any data type (as long as it can be sorted)
My first question is: why? What's the attraction in returning a sorted answer here? Returning an unsorted array is potentially faster, depending on the algorithm chosen, and sorting after the fact is trivial. If one was going to spend extra complexity on something, I'd think it would be better spent on preserving the input order. Second, some objects can be compared for equality and hashed, but not sorted (Python's complex number's come to mind). If one is going to worry about subtraction so as to keep things general, it makes sense to also avoid sorting as well Sasha's slick algorithm not withstanding. Third, I propose that whatever the outcome of the sorting issue, I would propose that unique have the same interface as the other structural array operations. That is: unique(anarray, axis=0): ... The default axis=0 is for compatibility with the other, somewhat similar functions. Axis=None would return the flattened, uniquified data, axis=# would uniquify the result along that axis. Regards, tim
My own version tried to capture all possible cases that the current unique captures.
Sasha's version only works for numpy arrays and has a problem for arrays with all identical entries.
David's version only works for numpy arrays of types that can be converted to float.
I would once more propose to use my own version as given before:
def unique(arr,sort=True): if hasattr(arr,'flatten'): tmp = arr.flatten() tmp.sort() idx = concatenate([True],tmp[1:]!=tmp[:1]) return tmp[idx] else: # for compatibility: set = {} for item in inseq: set[item] = None if sort: return asarray(sorted(set.keys())) else: return asarray(set.keys())
Greetings, Norbert
 Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with preintegrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.asus.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Numpydiscussion mailing list Numpydiscussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpydiscussion
Tim Hochberg wrote:
Norbert Nemec wrote:
unique1d is based on ediff1d, so it really calculates many differences and compares those to 0.0
This is inefficient, even though this is hidden by the general inefficiency of Python (It might be the reason for the two milliseconds, though)
What is more: subtraction works only for numbers, while the various proposed versions use only comparison which works for any data type (as long as it can be sorted)
My first question is: why? What's the attraction in returning a sorted answer here? Returning an unsorted array is potentially faster, depending on the algorithm chosen, and sorting after the fact is trivial. If one was going to spend extra complexity on something, I'd think it would be better spent on preserving the input order.
Second, some objects can be compared for equality and hashed, but not sorted (Python's complex number's come to mind). If one is going to worry about subtraction so as to keep things general, it makes sense to also avoid sorting as well Sasha's slick algorithm not withstanding.
Third, I propose that whatever the outcome of the sorting issue, I would propose that unique have the same interface as the other structural array operations. That is:
unique(anarray, axis=0): ...
The default axis=0 is for compatibility with the other, somewhat similar functions. Axis=None would return the flattened, uniquified data, axis=# would uniquify the result along that axis.
Hmmm. Of course that precludes it returning an actual array for axis!=None. That might be considered suboptimal... tim
Regards,
tim
My own version tried to capture all possible cases that the current unique captures.
Sasha's version only works for numpy arrays and has a problem for arrays with all identical entries.
David's version only works for numpy arrays of types that can be converted to float.
I would once more propose to use my own version as given before:
def unique(arr,sort=True): if hasattr(arr,'flatten'): tmp = arr.flatten() tmp.sort() idx = concatenate([True],tmp[1:]!=tmp[:1]) return tmp[idx] else: # for compatibility: set = {} for item in inseq: set[item] = None if sort: return asarray(sorted(set.keys())) else: return asarray(set.keys())
Greetings, Norbert
 Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with preintegrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.asus.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Numpydiscussion mailing list Numpydiscussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpydiscussion
 Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with preintegrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.asus.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Numpydiscussion mailing list Numpydiscussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpydiscussion
Tim Hochberg wrote:
Norbert Nemec wrote:
unique1d is based on ediff1d, so it really calculates many differences and compares those to 0.0
This is inefficient, even though this is hidden by the general inefficiency of Python (It might be the reason for the two milliseconds, though)
What is more: subtraction works only for numbers, while the various proposed versions use only comparison which works for any data type (as long as it can be sorted)
My first question is: why? What's the attraction in returning a sorted answer here? Returning an unsorted array is potentially faster, depending on the algorithm chosen, and sorting after the fact is trivial. If one was going to spend extra complexity on something, I'd think it would be better spent on preserving the input order.
One issue is that uniquifying numpy arrays using Python dicts, while expected O(n) in terms of complexity, is really slow compared to sorting because of the overhead in getting the elements out of the numpy arrays and into Python objects. For the cases where sorting works (your caveat below is quite correct), it's really quite good for numpy arrays. OTOH, if one were to implement a hash table in C, that might potentially be faster and more robust, but that is spending a *large* amount of extra complexity.
Second, some objects can be compared for equality and hashed, but not sorted (Python's complex number's come to mind). If one is going to worry about subtraction so as to keep things general, it makes sense to also avoid sorting as well Sasha's slick algorithm not withstanding.
 Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth."  Umberto Eco
Norbert Nemec wrote:
unique1d is based on ediff1d, so it really calculates many differences and compares those to 0.0
This is inefficient, even though this is hidden by the general inefficiency of Python (It might be the reason for the two milliseconds, though)
What is more: subtraction works only for numbers, while the various proposed versions use only comparison which works for any data type (as long as it can be sorted)
I agree that unique1d works only for numbers, but that is what it was meant for... well for integers only, in fact  everyone here surely knows, that comparing floats with != does not work well. Note also that it was written before logical indexing and other neat stuff were not possible in numpy  every improvement is welcome! (btw. I cannot recall why I used subtraction and testing for zero instead of just comparisons  maybe remnants from my old matlab days and its logical arrays  ediff1d should disappear from the other functions in arraysetops)
My own version tried to capture all possible cases that the current unique captures.
Sasha's version only works for numpy arrays and has a problem for arrays with all identical entries.
David's version only works for numpy arrays of types that can be converted to float.
comparing floats...
I would once more propose to use my own version as given before:
def unique(arr,sort=True): if hasattr(arr,'flatten'): tmp = arr.flatten() tmp.sort() idx = concatenate([True],tmp[1:]!=tmp[:1]) return tmp[idx] else: # for compatibility: set = {} for item in inseq: set[item] = None if sort: return asarray(sorted(set.keys())) else: return asarray(set.keys())
Have you considered using set instead of dict? Just curious :) r.
Tim Hochberg wrote:
My first question is: why? What's the attraction in returning a sorted answer here? Returning an unsorted array is potentially faster, depending on the algorithm chosen, and sorting after the fact is trivial. If one was going to spend extra complexity on something, I'd think it would be better spent on preserving the input order.
There is a unique function in matlab that returns a sorted vector. I think a lot of people will expect a numpy and matlab functions with identical names to behave similarly. If we want to preserve the input order, we'd have to choose a convention about whose value's order is retained: do we keep the order of the first value found or the last one ? Here is the benchmark. Sorry Norbert for not including your code the first time, it turns out that with Alain's suggestion, its the fastest one both for lists and arrays. x = rand(100000)*100 x = x.astype('i') l = list(x) For array x: In [166]: timeit unique_alan(x) # with set instead of dict 100 loops, best of 3: 8.8 ms per loop In [167]: timeit unique_norbert(x) 100 loops, best of 3: 8.8 ms per loop In [168]: timeit unique_sasha(x) 100 loops, best of 3: 10.8 ms per loop In [169]: timeit unique(x) 10 loops, best of 3: 50.4 ms per loop In [170]: timeit unique1d(x) 10 loops, best of 3: 13.2 ms per loop For list l: In [196]: timeit unique_norbert(l) 10 loops, best of 3: 29 ms per loop In [197]: timeit unique_alan(l) # with set instead of dict 10 loops, best of 3: 14.5 ms per loop In [193]: timeit unique(l) 10 loops, best of 3: 29.6 ms per loop Note: In Norbert function, setting sort=False for flattenable objects returns a sorted array anyway. So I'd suggest to remove the sort keyword, sort if the datatype is sortable, and don't sort if its not. David
Hi, I attached a patch for unique (with a test case) based on Norbert's suggestion. I removed the sort keyword since iterable arrays would be sorted anyway. The function uses a python set, and I was wondering if it is ok to assume that everyone running numpy has a python version >= 2.3 ? David
Hey David
Original Message From: numpydiscussionbounces@lists.sourceforge.net [mailto:numpy discussionbounces@lists.sourceforge.net] On Behalf Of David Huard Sent: 17 July 2006 16:11 To: numpydiscussion@lists.sourceforge.net Subject: Re: [Numpydiscussion] unique() should return a sorted array
Hi,
I attached a patch for unique (with a test case) based on Norbert's suggestion. I removed the sort keyword since iterable arrays would be sorted anyway. The function uses a python set, and I was wondering if it is ok to assume that everyone running numpy has a python version >= 2.3 ?
I think it has been discussed on the list that Python >= 2.3 is assumed. However, according to the Python documentation, the builtin set type is new in Python 2.4. Regards, Albert
Albert Strasheim wrote:
I think it has been discussed on the list that Python >= 2.3 is assumed. However, according to the Python documentation, the builtin set type is new in Python 2.4.
Sets are available in python 2.3 as part of the sets module so it is possible. However afaict the provided patch does not use the module and so will need to be adapted for use in 2.3.  "You see stars that clear have been dead for years But the idea just lives on..."  Bright Eyes
James> Sets are available in python 2.3 as part of the sets module so it James> is possible. However afaict the provided patch does not use the James> module and so will need to be adapted for use in 2.3. I got so tired of the ugly test for the set builtin during our 2.3to2.4 transition that I finally just added if sys.hexversion < 0x2040000: from sets import Set as set import __builtin__ __builtin__.set = set to sitecustomize.py. I'm not suggesting that scipy's installer should do the same, but the change worked for us. Skip
participants (10)

Alan G Isaac

Albert Strasheim

David Huard

James Graham

Norbert Nemec

Robert Cimrman

Robert Kern

Sasha

skip＠pobox.com

Tim Hochberg