indices of consecutive elements
I'm trying to figure out a way to return the indices of the start and end of a run of consecutive elements that match some condition, but only if there are more than a certain number. For example, take the array (with indices in comment for clarity): #0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0] I want to find the start and end indices of all runs of 1s with length of 4 or longer; so here the answer would be: [[2,5], [15,18]] Is there a reasonable way to do this without looping? I've been playing around with diff() and where() but without too much progress. Thanks, dan
Daniel, I coded a generic class that does what you want. It's not optimize, but at least should get you started. Let me know if you find it useful and if you find ways to tweak it... Cheers, P. _____ class Cluster(object): """ Groups consecutive data from an array according to a clustering condition. A cluster is defined as a group of consecutive values differing by at most the increment value. Missing values are **not** handled: the input sequence must therefore be free of missing values. Parameters ---------- darray : ndarray Input data array to clusterize. increment : {float}, optional Increment between two consecutive values to group. By default, use a value of 1. operator : {function}, optional Comparison operator for the definition of clusters. By default, use :func:`numpy.less_equal`. Attributes ---------- inishape Shape of the argument array (stored for resizing). inisize Size of the argument array. uniques : sequence List of unique cluster values, as they appear in chronological order. slices : sequence List of the slices corresponding to each cluster of data. starts : ndarray Array of the indices at which the clusters start. clustered : list List of clustered data. Examples -------- >>> A = [0, 0, 1, 2, 2, 2, 3, 4, 3, 4, 4, 4] >>> klust = cluster(A,0) >>> [list(_) for _ in klust.clustered] [[0, 0], [1], [2, 2, 2], [3], [4], [3], [4, 4, 4]] >>> klust.uniques array([0, 1, 2, 3, 4, 3, 4]) >>> x = [ 1.8, 1.3, 2.4, 1.2, 2.5, 3.9, 1. , 3.8, 4.2, 3.3, ... 1.2, 0.2, 0.9, 2.7, 2.4, 2.8, 2.7, 4.7, 4.2, 0.4] >>> Cluster(x,1).starts array([ 0, 2, 3, 4, 5, 6, 7, 10, 11, 13, 17, 19]) >>> Cluster(x,1.5).starts array([ 0, 6, 7, 10, 13, 17, 19]) >>> Cluster(x,2.5).starts array([ 0, 6, 7, 19]) >>> Cluster(x,2.5,greater).starts array([ 0, 1, 2, 3, 4, 5, 8, 9, 10, ... 11, 12, 13, 14, 15, 16, 17, 18]) >>> y = [ 0, -1, 0, 0, 0, 1, 1, -1, -1, -1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0] >>> Cluster(y,1).starts array([ 0, 1, 2, 5, 7, 10, 12, 16, 18]) """ def __init__(self,darray,increment=1,operator=np.less_equal): """ Initializes instance. Parameters ---------- darray : ndarray Input data array to clusterize. increment : {float}, optional Increment between two consecutive values to group. By default, use a value of 1. operator : {function}, optional Comparison operator for the definition of clusters. By default, use :func:`np.less_equal` """ if hasattr(darray,'mask') and darray.mask.any(): raise ma.MAError("Masked arrays should be filled prior clustering.") else: darray = np.asanyarray(darray) n = darray.size self.inishape = darray.shape self.inisize = darray.size clustercond = 1 - operator(np.absolute(np.diff(darray.ravel())), increment) sid = np.r_[[0,], np.arange(1,n).compress(clustercond), [n,]] slobj = np.asarray([slice(i,d) for (i,d) in np.broadcast(sid[:-1],sid[1:])]) # self.uniques = darray.ravel()[sid[:-1]] self.clustered = [darray[k] for k in slobj] self.sizes = np.asarray(np.diff(sid)) self.slices = slobj self.starts = sid[:-1] def markonsize(self,operator,sizethresh): """ Creates a **mask** for the clusters that do not meet a size requirement. Thus, outputs ``False`` if the size requirement is met, ``True`` otherwise. Parameters ---------- operator : function Comparison operator sizethresh : float Requirement for the sizes of the clusters """ resmask = np.empty(self.inisize, dtype=bool) resmask[:] = True # for k in self.slices.compress(operator(self.sizes,sizethresh)): for k in self.slices[operator(self.sizes,sizethresh)]: resmask[k] = False return resmask.reshape(self.inishape) def mark_greaterthan(self,sizemin): """ Shortcut for :meth:`markonsize(greater_equal,sizemin)`. Thus, the command outputs ``False`` for clusters larger than ``sizemin``, and ``True`` for clusters smaller than ``sizemin``. Parameters ---------- sizemin : int Minimum size of the clusters. See Also -------- :meth:`markonsize` Creates a **mask** for the clusters that do not meet a size requirement. """ return self.markonsize(np.greater_equal,sizemin) def grouped_slices(self): """ Returns a dictionary with the unique values of ``self`` as keys, and a list of slices for the corresponding values. See Also -------- :meth:`~Cluster.grouped_limits` that does the same thing """ # output = dict([(k,[]) for k in np.unique1d(self.uniques)]) for (k,v) in zip(self.uniques, self.slices): output[k].append(v) return output def grouped_limits(self): """ Returns a dictionary with the unique values of ``self`` as keys, and a list of tuples (starting index, ending index) for the corresponding values. See Also -------- :meth:`~Cluster.grouped_slices` """ output = dict([(k,[]) for k in np.unique1d(self.uniques)]) for (k,v) in zip(self.uniques, self.slices): output[k].append((v.start, v.stop)) for k in output: output[k] = np.array(output[k]) return output _____ On Dec 2, 2008, at 11:43 AM, Daniel Ashbrook wrote:
I'm trying to figure out a way to return the indices of the start and end of a run of consecutive elements that match some condition, but only if there are more than a certain number.
For example, take the array (with indices in comment for clarity):
#0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0]
I want to find the start and end indices of all runs of 1s with length of 4 or longer; so here the answer would be:
[[2,5], [15,18]]
Is there a reasonable way to do this without looping? I've been playing around with diff() and where() but without too much progress.
Thanks,
dan _______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-user
Wow, ask and ye shall receive way more than you expected! Thanks so much, Pierre - it's what I need: a=[0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0] c = Cluster(a,0) z = zip(list(c.starts), [list(i) for i in c.clustered]) r = [(i,i+len(j)-1) for i,j in z if j[0] == 1 and len(j) >= 4] print(r) [(2, 5), (15, 18)] dan Pierre GM wrote:
Daniel, I coded a generic class that does what you want. It's not optimize, but at least should get you started. Let me know if you find it useful and if you find ways to tweak it... Cheers, P.
_____
class Cluster(object): """ Groups consecutive data from an array according to a clustering condition. A cluster is defined as a group of consecutive values differing by at most the increment value.
Missing values are **not** handled: the input sequence must therefore be free of missing values.
Parameters ---------- darray : ndarray Input data array to clusterize. increment : {float}, optional Increment between two consecutive values to group. By default, use a value of 1. operator : {function}, optional Comparison operator for the definition of clusters. By default, use :func:`numpy.less_equal`.
Attributes ---------- inishape Shape of the argument array (stored for resizing). inisize Size of the argument array. uniques : sequence List of unique cluster values, as they appear in chronological order. slices : sequence List of the slices corresponding to each cluster of data. starts : ndarray Array of the indices at which the clusters start. clustered : list List of clustered data.
Examples -------- >>> A = [0, 0, 1, 2, 2, 2, 3, 4, 3, 4, 4, 4] >>> klust = cluster(A,0) >>> [list(_) for _ in klust.clustered] [[0, 0], [1], [2, 2, 2], [3], [4], [3], [4, 4, 4]] >>> klust.uniques array([0, 1, 2, 3, 4, 3, 4])
>>> x = [ 1.8, 1.3, 2.4, 1.2, 2.5, 3.9, 1. , 3.8, 4.2, 3.3, ... 1.2, 0.2, 0.9, 2.7, 2.4, 2.8, 2.7, 4.7, 4.2, 0.4] >>> Cluster(x,1).starts array([ 0, 2, 3, 4, 5, 6, 7, 10, 11, 13, 17, 19]) >>> Cluster(x,1.5).starts array([ 0, 6, 7, 10, 13, 17, 19]) >>> Cluster(x,2.5).starts array([ 0, 6, 7, 19]) >>> Cluster(x,2.5,greater).starts array([ 0, 1, 2, 3, 4, 5, 8, 9, 10, ... 11, 12, 13, 14, 15, 16, 17, 18]) >>> y = [ 0, -1, 0, 0, 0, 1, 1, -1, -1, -1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0] >>> Cluster(y,1).starts array([ 0, 1, 2, 5, 7, 10, 12, 16, 18])
""" def __init__(self,darray,increment=1,operator=np.less_equal): """ Initializes instance.
Parameters ---------- darray : ndarray Input data array to clusterize. increment : {float}, optional Increment between two consecutive values to group. By default, use a value of 1. operator : {function}, optional Comparison operator for the definition of clusters. By default, use :func:`np.less_equal`
""" if hasattr(darray,'mask') and darray.mask.any(): raise ma.MAError("Masked arrays should be filled prior clustering.") else: darray = np.asanyarray(darray) n = darray.size self.inishape = darray.shape self.inisize = darray.size clustercond = 1 - operator(np.absolute(np.diff(darray.ravel())), increment) sid = np.r_[[0,], np.arange(1,n).compress(clustercond), [n,]] slobj = np.asarray([slice(i,d) for (i,d) in np.broadcast(sid[:-1],sid[1:])]) # self.uniques = darray.ravel()[sid[:-1]] self.clustered = [darray[k] for k in slobj] self.sizes = np.asarray(np.diff(sid)) self.slices = slobj self.starts = sid[:-1]
def markonsize(self,operator,sizethresh): """ Creates a **mask** for the clusters that do not meet a size requirement. Thus, outputs ``False`` if the size requirement is met, ``True`` otherwise.
Parameters ---------- operator : function Comparison operator sizethresh : float Requirement for the sizes of the clusters
""" resmask = np.empty(self.inisize, dtype=bool) resmask[:] = True # for k in self.slices.compress(operator(self.sizes,sizethresh)): for k in self.slices[operator(self.sizes,sizethresh)]: resmask[k] = False return resmask.reshape(self.inishape)
def mark_greaterthan(self,sizemin): """ Shortcut for :meth:`markonsize(greater_equal,sizemin)`. Thus, the command outputs ``False`` for clusters larger than ``sizemin``, and ``True`` for clusters smaller than ``sizemin``.
Parameters ---------- sizemin : int Minimum size of the clusters.
See Also -------- :meth:`markonsize` Creates a **mask** for the clusters that do not meet a size requirement. """ return self.markonsize(np.greater_equal,sizemin)
def grouped_slices(self): """ Returns a dictionary with the unique values of ``self`` as keys, and a list of slices for the corresponding values.
See Also -------- :meth:`~Cluster.grouped_limits` that does the same thing """ # output = dict([(k,[]) for k in np.unique1d(self.uniques)]) for (k,v) in zip(self.uniques, self.slices): output[k].append(v) return output
def grouped_limits(self): """ Returns a dictionary with the unique values of ``self`` as keys, and a list of tuples (starting index, ending index) for the corresponding values.
See Also -------- :meth:`~Cluster.grouped_slices` """ output = dict([(k,[]) for k in np.unique1d(self.uniques)]) for (k,v) in zip(self.uniques, self.slices): output[k].append((v.start, v.stop)) for k in output: output[k] = np.array(output[k]) return output
_____
On Dec 2, 2008, at 11:43 AM, Daniel Ashbrook wrote:
I'm trying to figure out a way to return the indices of the start and end of a run of consecutive elements that match some condition, but only if there are more than a certain number.
For example, take the array (with indices in comment for clarity):
#0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0]
I want to find the start and end indices of all runs of 1s with length of 4 or longer; so here the answer would be:
[[2,5], [15,18]]
Is there a reasonable way to do this without looping? I've been playing around with diff() and where() but without too much progress.
Thanks,
dan _______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-user
_______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-user
On Dec 2, 2008, at 1:37 PM, Daniel Ashbrook wrote:
Wow, ask and ye shall receive way more than you expected! Thanks so much, Pierre - it's what I need:
a=[0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0] c = Cluster(a,0) z = zip(list(c.starts), [list(i) for i in c.clustered]) r = [(i,i+len(j)-1) for i,j in z if j[0] == 1 and len(j) >= 4] print(r)
[(2, 5), (15, 18)]
There's simpler:
[(_.start,_.stop-1) for _ in c.slices[(c.sizes>=4) & (c.uniques==1)]]
as c.slices is an array, you can directly select its elements that satisfy some conditions: here, we take the clusters that cover at least 4 element, provided the unique element is 1...
Aaah, so much better! I'm still getting used to how arrays operate differently than python lists. That's really cool - thanks a lot! This will make my work so much easier. dan Pierre GM wrote:
There's simpler:
[(_.start,_.stop-1) for _ in c.slices[(c.sizes>=4) & (c.uniques==1)]]
as c.slices is an array, you can directly select its elements that satisfy some conditions: here, we take the clusters that cover at least 4 element, provided the unique element is 1...
On Dec 2, 2008, at 2:12 PM, Daniel Ashbrook wrote:
Aaah, so much better! I'm still getting used to how arrays operate differently than python lists.
I know... It's so easy to select specific elements of an array that it's frustrating not to be able to do the same thing w/ lists...
That's really cool - thanks a lot! This will make my work so much easier.
Glad I could help. Your feedback is needed, let me know if you run into some bugs or other problems, so that I can update my code (it's part of a set of tools I'm writing to help the analysis of climate data...).
Here's a way that uses ufuncs. from numpy import * v = [0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0] myufunc = frompyfunc(lambda a, b: (a+b)*b, 2, 1) where(diff(myufunc.accumulate(v)) <= -4) This gives (array([5,18]),) where 5 and 18 are the right hand indices; the left hand indices can be found similarly. Alex Daniel Ashbrook wrote:
I'm trying to figure out a way to return the indices of the start and end of a run of consecutive elements that match some condition, but only if there are more than a certain number.
For example, take the array (with indices in comment for clarity):
#0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0]
I want to find the start and end indices of all runs of 1s with length of 4 or longer; so here the answer would be:
[[2,5], [15,18]]
Is there a reasonable way to do this without looping? I've been playing around with diff() and where() but without too much progress.
Thanks,
dan _______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-user
Here's a way that uses convolution. from numpy import * v = array([0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0]) n=4 c = convolve(v, ones(n+1))[:len(v)] where((c==n) & (v==0)) This gives (array([ 6, 19]),) where 6 and 19 are off by one from the correct end indices. Maybe there's a more efficient way to do a convolution with this kind of rectangular window, using the cumulative sum for example. Alex Daniel Ashbrook wrote:
I'm trying to figure out a way to return the indices of the start and end of a run of consecutive elements that match some condition, but only if there are more than a certain number.
For example, take the array (with indices in comment for clarity):
#0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0]
I want to find the start and end indices of all runs of 1s with length of 4 or longer; so here the answer would be:
[[2,5], [15,18]]
Is there a reasonable way to do this without looping? I've been playing around with diff() and where() but without too much progress.
Thanks,
dan _______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-user
participants (3)
-
alex
-
Daniel Ashbrook
-
Pierre GM