On Wed, Aug 25, 2010 at 9:00 AM, John Hunter <jdh2358@gmail.com> wrote:
Suppose I have an ordered list/array of numbers, and I want to split
them into N chunks, such that the intersection of any chunk with each
other is empty and the data is split as evenly as possible (eg the std
dev of the lengths of the chunks is minimized or some other such
criterion).  Context: I am trying to do a quintile analysis on some
data, and np.percentile doesn't behave like I want because more than
20% of my data equals 1, so 1  is in the first and second quintiles.
I want to avoid this -- I'd rather have uneven counts in my quintiles
than have the same value show up in multiple quintiles, but I'd like
the counts to be as even as possible..

Here is some sample code that illustrates my problem:

In [178]: run ~/test

tile i=1 range=[1.00, 1.00), count=0
tile i=2 range=[1.00, 3.00), count=79
tile i=3 range=[3.00, 4.60), count=42
tile i=4 range=[4.60, 11.00), count=39
tile i=5 range=[11.00, 43.00), count=41


import numpy as np

x = np.array([  2.,   3.,   4.,   5.,   1.,   2.,   1.,   1.,   1.,   2.,   3.,
        1.,   2.,   3.,   1.,   2.,   3.,   1.,   2.,   3.,   4.,   1.,
        1.,   2.,   3.,   2.,   2.,   3.,   4.,   5.,   1.,   2.,   3.,
        4.,   5.,   6.,   7.,   1.,   1.,   2.,   3.,   4.,   5.,   6.,
        7.,   1.,   2.,   3.,   1.,   2.,   1.,   2.,   3.,   1.,   2.,
        4.,   1.,   2.,   1.,   2.,   3.,   4.,   5.,   6.,   1.,   2.,
        3.,   1.,   1.,   1.,   1.,   1.,   1.,   2.,   1.,   2.,   3.,
        1.,   2.,   3.,   1.,   1.,   1.,   2.,   3.,   4.,   5.,   6.,
        7.,   8.,   9.,  10.,   1.,   1.,   2.,   3.,   1.,   2.,   3.,
        4.,   5.,   6.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,
        9.,  10.,  11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,
       20.,  21.,  22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,
       31.,  32.,  33.,  34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,
       42.,  43.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   1.,   2.,
        3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,  12.,  13.,
       14.,  15.,  16.,  17.,  18.,  19.,   1.,   2.,   1.,   2.,   3.,
        4.,   5.,   6.,   1.,   2.,   3.,   4.,   5.,   6.,   1.,   2.,
        3.,   4.,   1.,   2.,   3.,   4.,   5.,   6.,   1.,   2.,   3.,
        1.,   2.,   1.,   2.])


tiles = np.percentile(x, (0, 20, 40, 60, 80, 100))

print
for i in range(1, len(tiles)):
   xmin, xmax = tiles[i-1], tiles[i]
   print 'tile i=%d range=[%.2f, %.2f), count=%d'%(i, xmin, xmax,
((x>=xmin) & (x<xmax)).sum())

Just a crazy thought, but maybe kmeans clustering might be what you are looking for?  If you know ahead of time the number of bins you want, you can let kmeans try and group things automatically.  The ones will all fall into the same membership (and any other duplicated values will, too).  If you sort the data first, then the behavior should be consistent.

I once used kmeans to help "snap" height data from multiple observations together onto a common set of heights.  The obs would have many zero height values, but then the rest of the values would not have many repeated values.  This approach worked great in our particular application.

Ben Root