Mailman 3 sampling based on running sums - NumPy-Discussion

27 Jun 2008

      I would like to find the sample points where the running sum of some
vector exceeds some threshold -- at those points I want to collect all
the data in the vector since the last time the criteria was reached
and compute some stats on it.  For example, in python

    tot = 0.
    xs = []
    ys = []

    samples1 = []
    for thisx, thisy in zip(x, y):
        tot += thisx
        xs.append(thisx)
        ys.append(thisy)
        if tot>=threshold:
            samples1.append(func(xs,ys))
            tot = 0.
            xs = []
            ys = []

The following is close in numpy

    sx = np.cumsum(x)
    n = (sx/threshold).astype(int)
    ind = np.nonzero(np.diff(n)>0)[0]+1

    lasti = 0
    samples2 = []
    for i in ind:
        xs = x[lasti:i+1]
        ys = y[lasti:i+1]
        samples2.append(func(xs, ys))
        lasti = i

But the sample points in ind do no guarantee that at least threshold
points are between the sample points due to truncation error.

What is a good numpy way to do this?

Thanks,
JDH

sampling based on running sums

John Hunter

tags

participants (2)