[Numpy-discussion] Robust Sorting of Points

Brett Olsen brett.olsen at gmail.com
Mon Oct 28 15:21:15 EDT 2013


Here's some code implementing the "replace similar values with an
arbitrarily chosen one" (in this case the smallest of the similar values).
 I didn't see any way to do this cleverly with strides, so I just did a
simple loop.  It's about 100 times slower in pure Python, or a bit under 10
times slower if you're willing to use a bit of Cython.  Not sure if this is
good enough for your purposes.  I imagine you could go a bit faster if you
were willing to do the lexical integration by hand (since you've already
done the separate sorting of each subarray for value replacement purposes)
instead of passing that off to np.lexsort.

Note that this approach will only work if your points are not only
well-separated in space but also either well-separated or identical in each
dimension as well.  It's OK to have points with the same, say, x value, but
if you have points that have close x values before the noise is added, then
the noise can move intermediate points around in the sort order.  It works
well with the gridded data I used as a sample, but if you're, say,
generating random points, this could be a problem:

point 1 is (1, 0, 1e-12)
point 2 is (0, 1, 0)

These are well separated.  The algorithm will pool those z values and
report 1 as coming before 2.  Unless you get jitter like this:

point 1: (1, 0, 1.5e-12)
point 2: (0, 1, -0.5e-12)

Now they won't be pooled any more and we'll get 2 as coming before 1.

Anyway, here's the code:

In [1]:

import numpy as np
In [2]:

def gen_grid(n, d):
    #Generate a bunch of grid points, n in each dimension of spacing d
    vals = np.linspace(0, (n-1)*d, n)
    x, y, z = np.meshgrid(vals, vals, vals)
    grid = np.empty((n**3, 3))
    grid[:,0] = x.flatten()
    grid[:,1] = y.flatten()
    grid[:,2] = z.flatten()
    return grid

def jitter(array, epsilon=1e-12):
    #Add random jitter from a uniform distribution of width epsilon
    return array + np.random.random(array.shape) * epsilon - epsilon / 2
In [3]:

grid = gen_grid(4, 0.1)
print np.lexsort(grid.T)
print np.lexsort(jitter(grid.T))
[ 0  4  8 12 16 20 24 28 32 36 40 44 48 52 56 60  1  5  9 13 17 21 25 29 33
 37 41 45 49 53 57 61  2  6 10 14 18 22 26 30 34 38 42 46 50 54 58 62  3  7
 11 15 19 23 27 31 35 39 43 47 51 55 59 63]
[60  4 48 32 40 12 36 28 44 56 16  8 24  0 52 20 45 25 49  1 53 29  9 33  5
 61 41 37 17 13 21 57 22 50 18 10  2 62 58 54  6 34 26 42 38 46 14 30  3 11
 55 63 27 15 35 43 31 39  7 59 47 23 51 19]
In [4]:

def pool_values(A, epsilon=1e-12):
    idx = np.argsort(A)
    for i in range(1, len(A)):
        if A[idx[i]] - A[idx[i-1]] < epsilon:
            A[idx[i]] = A[idx[i-1]]
    return A

def stable_sort(grid):
    return np.lexsort((pool_values(grid[:,0]),
                       pool_values(grid[:,1]),
                       pool_values(grid[:,2])))
In [5]:

print stable_sort(grid)
print stable_sort(jitter(grid))
[ 0  4  8 12 16 20 24 28 32 36 40 44 48 52 56 60  1  5  9 13 17 21 25 29 33
 37 41 45 49 53 57 61  2  6 10 14 18 22 26 30 34 38 42 46 50 54 58 62  3  7
 11 15 19 23 27 31 35 39 43 47 51 55 59 63]
[ 0  4  8 12 16 20 24 28 32 36 40 44 48 52 56 60  1  5  9 13 17 21 25 29 33
 37 41 45 49 53 57 61  2  6 10 14 18 22 26 30 34 38 42 46 50 54 58 62  3  7
 11 15 19 23 27 31 35 39 43 47 51 55 59 63]
In [6]:

%timeit np.lexsort(jitter(grid.T))
100000 loops, best of 3: 10.4 µs per loop
In [7]:

%timeit stable_sort(jitter(grid))
1000 loops, best of 3: 1.39 ms per loop
In [8]:

%load_ext cythonmagic
In [12]:

%%cython
import numpy as np
cimport numpy as np

cdef fast_pool_values(double[:] A, double epsilon=1e-12):
    cdef long[:] idx = np.argsort(A)
    cdef int i
    for i in range(1, len(A)):
        if A[idx[i]] - A[idx[i-1]] < epsilon:
            A[idx[i]] = A[idx[i-1]]
    return A

def fast_stable_sort(grid):
    return np.lexsort((fast_pool_values(grid[:,0]),
                       fast_pool_values(grid[:,1]),
                       fast_pool_values(grid[:,2])))
In [10]:

%timeit np.lexsort(jitter(grid.T))
10000 loops, best of 3: 38.5 µs per loop
In [13]:

%timeit fast_stable_sort(jitter(grid))
1000 loops, best of 3: 309 µs per loop


On Sun, Oct 27, 2013 at 5:41 PM, Freddie Witherden <freddie at witherden.org>wrote:

> On 27/10/13 21:05, Jonathan March wrote:
> > If an "almost always works" solution is good enough, then sort on the
> > distance to some fixed random point that is in the vicinity of your N
> > points.
>
> I had considered this.  Unfortunately I need a solution which really
> does always work.
>
> The only pure-Python solution I can envision -- at the moment anyway --
> is to do some cleverness with the output of np.unique to identify
> similar values and replace them with an arbitrarily chosen one.  This
> should permit the output to be passed to np.lexsort without issue.
>
> Regards, Freddie.
>
>
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20131028/5bb38282/attachment.html>


More information about the NumPy-Discussion mailing list