splitting an ordered list as evenly as possilbe
Suppose I have an ordered list/array of numbers, and I want to split them into N chunks, such that the intersection of any chunk with each other is empty and the data is split as evenly as possible (eg the std dev of the lengths of the chunks is minimized or some other such criterion). Context: I am trying to do a quintile analysis on some data, and np.percentile doesn't behave like I want because more than 20% of my data equals 1, so 1 is in the first and second quintiles. I want to avoid this -- I'd rather have uneven counts in my quintiles than have the same value show up in multiple quintiles, but I'd like the counts to be as even as possible.. Here is some sample code that illustrates my problem: In [178]: run ~/test tile i=1 range=[1.00, 1.00), count=0 tile i=2 range=[1.00, 3.00), count=79 tile i=3 range=[3.00, 4.60), count=42 tile i=4 range=[4.60, 11.00), count=39 tile i=5 range=[11.00, 43.00), count=41 import numpy as np x = np.array([ 2., 3., 4., 5., 1., 2., 1., 1., 1., 2., 3., 1., 2., 3., 1., 2., 3., 1., 2., 3., 4., 1., 1., 2., 3., 2., 2., 3., 4., 5., 1., 2., 3., 4., 5., 6., 7., 1., 1., 2., 3., 4., 5., 6., 7., 1., 2., 3., 1., 2., 1., 2., 3., 1., 2., 4., 1., 2., 1., 2., 3., 4., 5., 6., 1., 2., 3., 1., 1., 1., 1., 1., 1., 2., 1., 2., 3., 1., 2., 3., 1., 1., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 1., 1., 2., 3., 1., 2., 3., 4., 5., 6., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30., 31., 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43., 1., 2., 3., 4., 5., 6., 7., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 1., 2., 1., 2., 3., 4., 5., 6., 1., 2., 3., 4., 5., 6., 1., 2., 3., 4., 1., 2., 3., 4., 5., 6., 1., 2., 3., 1., 2., 1., 2.]) tiles = np.percentile(x, (0, 20, 40, 60, 80, 100)) print for i in range(1, len(tiles)): xmin, xmax = tiles[i-1], tiles[i] print 'tile i=%d range=[%.2f, %.2f), count=%d'%(i, xmin, xmax, ((x>=xmin) & (x<xmax)).sum())
On Wed, Aug 25, 2010 at 7:00 AM, John Hunter <jdh2358@gmail.com> wrote:
Suppose I have an ordered list/array of numbers, and I want to split them into N chunks, such that the intersection of any chunk with each other is empty and the data is split as evenly as possible (eg the std dev of the lengths of the chunks is minimized or some other such criterion).
How about using the percentiles of np.unique(x)? That takes care of the first constraint (no overlap) but ignores the second constraint (min std of cluster size).
On Wed, Aug 25, 2010 at 7:19 AM, John Hunter <jdh2358@gmail.com> wrote:
On Wed, Aug 25, 2010 at 9:10 AM, Keith Goodman <kwgoodman@gmail.com> wrote:
How about using the percentiles of np.unique(x)? That takes care of the first constraint (no overlap) but ignores the second constraint (min std of cluster size).
Well, I need the 2nd constraint....
Both can't be hard constraints, so I guess the first step is to define a utility function that quantifies the trade off between the two. Would it make sense to then start from the percentile(unique(x), ...) solution and come up with a heuristic that moves an item with lots of repeats in a large length quintile to a short lenght quintile and then accept the moves if it improves the utility? Or try moving each item to each of the other 4 quintiles and do the move the improves the utility the most. Then repeat until the utility doesn't improve. But I guess I'm just stating the obvious and you are looking for something less obvious and more clever.
On Wed, Aug 25, 2010 at 10:32 AM, Keith Goodman <kwgoodman@gmail.com> wrote:
On Wed, Aug 25, 2010 at 7:19 AM, John Hunter <jdh2358@gmail.com> wrote:
On Wed, Aug 25, 2010 at 9:10 AM, Keith Goodman <kwgoodman@gmail.com> wrote:
How about using the percentiles of np.unique(x)? That takes care of the first constraint (no overlap) but ignores the second constraint (min std of cluster size).
Well, I need the 2nd constraint....
Both can't be hard constraints, so I guess the first step is to define a utility function that quantifies the trade off between the two. Would it make sense to then start from the percentile(unique(x), ...) solution and come up with a heuristic that moves an item with lots of repeats in a large length quintile to a short lenght quintile and then accept the moves if it improves the utility? Or try moving each item to each of the other 4 quintiles and do the move the improves the utility the most. Then repeat until the utility doesn't improve. But I guess I'm just stating the obvious and you are looking for something less obvious and more clever. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
What I'm doing for some statistical analysis, e.g. chisquare test with integer data (discrete random variable)? np.bincount to get the full count, or use theoretical pdf, then loop over the integers (raw bins) and merge them to satisfy the constraints. constraints that I'm using are equal binsizes in one version and minimum binsizes in the second version. I haven't found anything else than the loop over the uniques, but I think there was some discussion on this some time ago on a mailing list. Josef
On 08/25/2010 09:44 AM, josef.pktd@gmail.com wrote:
On Wed, Aug 25, 2010 at 10:32 AM, Keith Goodman<kwgoodman@gmail.com> wrote:
On Wed, Aug 25, 2010 at 7:19 AM, John Hunter<jdh2358@gmail.com> wrote:
On Wed, Aug 25, 2010 at 9:10 AM, Keith Goodman<kwgoodman@gmail.com> wrote:
How about using the percentiles of np.unique(x)? That takes care of the first constraint (no overlap) but ignores the second constraint (min std of cluster size). Well, I need the 2nd constraint.... Both can't be hard constraints, so I guess the first step is to define a utility function that quantifies the trade off between the two. Would it make sense to then start from the percentile(unique(x), ...) solution and come up with a heuristic that moves an item with lots of repeats in a large length quintile to a short lenght quintile and then accept the moves if it improves the utility? Or try moving each item to each of the other 4 quintiles and do the move the improves the utility the most. Then repeat until the utility doesn't improve. But I guess I'm just stating the obvious and you are looking for something less obvious and more clever.
SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
What I'm doing for some statistical analysis, e.g. chisquare test with integer data (discrete random variable)?
np.bincount to get the full count, or use theoretical pdf, then loop over the integers (raw bins) and merge them to satisfy the constraints.
constraints that I'm using are equal binsizes in one version and minimum binsizes in the second version.
I haven't found anything else than the loop over the uniques, but I think there was some discussion on this some time ago on a mailing list.
Josef _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev As others have indicated you have to work with the unique values as well as the frequencies.
Hopefully you can determine what I mean from the code below and modify it as needed. It is brute force but provides a couple of options as the following output indicates. 3 [(2, 44), (3, 35), (5, 42), (13, 43), (43, 38)] 4 [(2, 44), (3, 35), (5, 42), (14, 45), (43, 36)] 5 [(2, 44), (3, 35), (5, 42), (14, 45), (43, 36)] 6 [(2, 44), (3, 35), (5, 42), (15, 47), (43, 34)] 7 [(2, 44), (3, 35), (5, 42), (15, 47), (43, 34)] 8 [(2, 44), (3, 35), (5, 42), (16, 49), (43, 32)] 9 [(2, 44), (3, 35), (5, 42), (16, 49), (43, 32)] Some notes: 1) For this example, you need an average of 41 per group (202 elements divided by 5). But that will be impossible because the value '1' has a frequency of 44, the sum of frequencies of '2' and '3' is 61. This means we need some way to allow slight increases in sizes - I use the variable eval which is the expected count plus some threshold (berror). If you have floats then you can not use np.bincount directly. So if these are integers use them directly or use some function to create these in the desirable range (such as np.ceil or work with 10*x etc.) Bruce binx=np.bincount(x.astype(int)) for berror in range(10): # loop over a range of possible variations in the counts eval=berror+np.ceil(binx.sum()/5.0) # find a count threshold count=0 quintile=[] for i in range(binx.shape[0]): #loop over the frequencies to determine which bin if count+binx[i] > (eval): # If the bin overflows then start a new one quintile.append((i, count)) count=binx[i] else: #other keep adding into current bin count +=binx[i] quintile.append((i, count)) #add the last bin if len(quintile)==5: # we must have five bins otherwise that loop is useless. You can also apply other criteria here as well. print berror, quintile
On Wed, Aug 25, 2010 at 11:44 AM, Bruce Southey <bsouthey@gmail.com> wrote:
On 08/25/2010 09:44 AM, josef.pktd@gmail.com wrote:
On Wed, Aug 25, 2010 at 10:32 AM, Keith Goodman<kwgoodman@gmail.com> wrote:
On Wed, Aug 25, 2010 at 7:19 AM, John Hunter<jdh2358@gmail.com> wrote:
On Wed, Aug 25, 2010 at 9:10 AM, Keith Goodman<kwgoodman@gmail.com> wrote:
How about using the percentiles of np.unique(x)? That takes care of the first constraint (no overlap) but ignores the second constraint (min std of cluster size). Well, I need the 2nd constraint.... Both can't be hard constraints, so I guess the first step is to define a utility function that quantifies the trade off between the two. Would it make sense to then start from the percentile(unique(x), ...) solution and come up with a heuristic that moves an item with lots of repeats in a large length quintile to a short lenght quintile and then accept the moves if it improves the utility? Or try moving each item to each of the other 4 quintiles and do the move the improves the utility the most. Then repeat until the utility doesn't improve. But I guess I'm just stating the obvious and you are looking for something less obvious and more clever.
SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
What I'm doing for some statistical analysis, e.g. chisquare test with integer data (discrete random variable)?
np.bincount to get the full count, or use theoretical pdf, then loop over the integers (raw bins) and merge them to satisfy the constraints.
constraints that I'm using are equal binsizes in one version and minimum binsizes in the second version.
I haven't found anything else than the loop over the uniques, but I think there was some discussion on this some time ago on a mailing list.
Josef _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev As others have indicated you have to work with the unique values as well as the frequencies.
Hopefully you can determine what I mean from the code below and modify it as needed. It is brute force but provides a couple of options as the following output indicates.
3 [(2, 44), (3, 35), (5, 42), (13, 43), (43, 38)] 4 [(2, 44), (3, 35), (5, 42), (14, 45), (43, 36)] 5 [(2, 44), (3, 35), (5, 42), (14, 45), (43, 36)] 6 [(2, 44), (3, 35), (5, 42), (15, 47), (43, 34)] 7 [(2, 44), (3, 35), (5, 42), (15, 47), (43, 34)] 8 [(2, 44), (3, 35), (5, 42), (16, 49), (43, 32)] 9 [(2, 44), (3, 35), (5, 42), (16, 49), (43, 32)]
Some notes: 1) For this example, you need an average of 41 per group (202 elements divided by 5). But that will be impossible because the value '1' has a frequency of 44, the sum of frequencies of '2' and '3' is 61. This means we need some way to allow slight increases in sizes - I use the variable eval which is the expected count plus some threshold (berror).
If you have floats then you can not use np.bincount directly. So if these are integers use them directly or use some function to create these in the desirable range (such as np.ceil or work with 10*x etc.)
Bruce
binx=np.bincount(x.astype(int)) for berror in range(10): # loop over a range of possible variations in the counts eval=berror+np.ceil(binx.sum()/5.0) # find a count threshold count=0 quintile=[] for i in range(binx.shape[0]): #loop over the frequencies to determine which bin if count+binx[i] > (eval): # If the bin overflows then start a new one quintile.append((i, count)) count=binx[i] else: #other keep adding into current bin count +=binx[i] quintile.append((i, count)) #add the last bin if len(quintile)==5: # we must have five bins otherwise that loop is useless. You can also apply other criteria here as well. print berror, quintile
I don't think you can assume anything about the minimum number of bins in general. For example, my similar code needed to work also for binary distributions with at most two unique values and bins. A degenerate distribution would have only a single value and bin. eval is a built-in function Josef
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
On 8/25/10 8:00 AM, John Hunter wrote:
Suppose I have an ordered list/array of numbers, and I want to split them into N chunks, such that the intersection of any chunk with each other is empty and the data is split as evenly as possible (eg the std dev of the lengths of the chunks is minimized or some other such criterion). Context: I am trying to do a quintile analysis on some data, and np.percentile doesn't behave like I want because more than 20% of my data equals 1, so 1 is in the first and second quintiles. I want to avoid this -- I'd rather have uneven counts in my quintiles than have the same value show up in multiple quintiles, but I'd like the counts to be as even as possible..
Here is some sample code that illustrates my problem:
....
John: This is a problem we have quite often analyzing precip data in arid regions - most of the time it just doesn't rain so the distribution has a delta function peak at zero. There is no good way around it. Sometimes people split up the sample into rain and no-rain, and treat the two distributions separately. -Jeff -- Jeffrey S. Whitaker Phone : (303)497-6313 Meteorologist FAX : (303)497-6449 NOAA/OAR/PSD R/PSD1 Email : Jeffrey.S.Whitaker@noaa.gov 325 Broadway Office : Skaggs Research Cntr 1D-113 Boulder, CO, USA 80303-3328 Web : http://tinyurl.com/5telg
On Wed, Aug 25, 2010 at 9:00 AM, John Hunter <jdh2358@gmail.com> wrote:
Suppose I have an ordered list/array of numbers, and I want to split them into N chunks, such that the intersection of any chunk with each other is empty and the data is split as evenly as possible (eg the std dev of the lengths of the chunks is minimized or some other such criterion). Context: I am trying to do a quintile analysis on some data, and np.percentile doesn't behave like I want because more than 20% of my data equals 1, so 1 is in the first and second quintiles. I want to avoid this -- I'd rather have uneven counts in my quintiles than have the same value show up in multiple quintiles, but I'd like the counts to be as even as possible..
Here is some sample code that illustrates my problem:
In [178]: run ~/test
tile i=1 range=[1.00, 1.00), count=0 tile i=2 range=[1.00, 3.00), count=79 tile i=3 range=[3.00, 4.60), count=42 tile i=4 range=[4.60, 11.00), count=39 tile i=5 range=[11.00, 43.00), count=41
import numpy as np
x = np.array([ 2., 3., 4., 5., 1., 2., 1., 1., 1., 2., 3., 1., 2., 3., 1., 2., 3., 1., 2., 3., 4., 1., 1., 2., 3., 2., 2., 3., 4., 5., 1., 2., 3., 4., 5., 6., 7., 1., 1., 2., 3., 4., 5., 6., 7., 1., 2., 3., 1., 2., 1., 2., 3., 1., 2., 4., 1., 2., 1., 2., 3., 4., 5., 6., 1., 2., 3., 1., 1., 1., 1., 1., 1., 2., 1., 2., 3., 1., 2., 3., 1., 1., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 1., 1., 2., 3., 1., 2., 3., 4., 5., 6., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30., 31., 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43., 1., 2., 3., 4., 5., 6., 7., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 1., 2., 1., 2., 3., 4., 5., 6., 1., 2., 3., 4., 5., 6., 1., 2., 3., 4., 1., 2., 3., 4., 5., 6., 1., 2., 3., 1., 2., 1., 2.])
tiles = np.percentile(x, (0, 20, 40, 60, 80, 100))
print for i in range(1, len(tiles)): xmin, xmax = tiles[i-1], tiles[i] print 'tile i=%d range=[%.2f, %.2f), count=%d'%(i, xmin, xmax, ((x>=xmin) & (x<xmax)).sum())
Just a crazy thought, but maybe kmeans clustering might be what you are looking for? If you know ahead of time the number of bins you want, you can let kmeans try and group things automatically. The ones will all fall into the same membership (and any other duplicated values will, too). If you sort the data first, then the behavior should be consistent. I once used kmeans to help "snap" height data from multiple observations together onto a common set of heights. The obs would have many zero height values, but then the rest of the values would not have many repeated values. This approach worked great in our particular application. Ben Root
participants (6)
-
Benjamin Root -
Bruce Southey -
Jeff Whitaker -
John Hunter -
josef.pktd@gmail.com -
Keith Goodman