On Wed, Aug 25, 2010 at 10:32 AM, Keith Goodman <kwgoodman@gmail.com> wrote:
On Wed, Aug 25, 2010 at 7:19 AM, John Hunter <jdh2358@gmail.com> wrote:
On Wed, Aug 25, 2010 at 9:10 AM, Keith Goodman <kwgoodman@gmail.com> wrote:
How about using the percentiles of np.unique(x)? That takes care of the first constraint (no overlap) but ignores the second constraint (min std of cluster size).
Well, I need the 2nd constraint....
Both can't be hard constraints, so I guess the first step is to define a utility function that quantifies the trade off between the two. Would it make sense to then start from the percentile(unique(x), ...) solution and come up with a heuristic that moves an item with lots of repeats in a large length quintile to a short lenght quintile and then accept the moves if it improves the utility? Or try moving each item to each of the other 4 quintiles and do the move the improves the utility the most. Then repeat until the utility doesn't improve. But I guess I'm just stating the obvious and you are looking for something less obvious and more clever. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
What I'm doing for some statistical analysis, e.g. chisquare test with integer data (discrete random variable)? np.bincount to get the full count, or use theoretical pdf, then loop over the integers (raw bins) and merge them to satisfy the constraints. constraints that I'm using are equal binsizes in one version and minimum binsizes in the second version. I haven't found anything else than the loop over the uniques, but I think there was some discussion on this some time ago on a mailing list. Josef