[SciPy-User] Identify unique sequence data from array
Robert Kern
robert.kern at gmail.com
Wed Dec 22 15:57:07 EST 2010
On Wed, Dec 22, 2010 at 15:27, <josef.pktd at gmail.com> wrote:
> On Wed, Dec 22, 2010 at 3:18 PM, otrov <dejan.org at gmail.com> wrote:
>>>> The problem:
>>
>>>> I have 2D data sets (scipy/numpy arrays) of 10^7 to 10^8 rows, which consists of repeated sequences of one unique sequence, usually ~10^5 rows, but may differ in scale. Period is same for both columns, so there is not really difference if we consider 2D or 1D array.
>>>> I want to track this data block.
>>
>>> for i in range(1, len(X)-1):
>>> if (X[i:] == X[:-i]).all():
>>> break
>
> I don't see how this works, isn't it
>
> (X[:i] == X[-i:]).all():
Not if the repeated subsequence is [1, 2, 3, 1, 2]. That said, my
method probably also has a counterexample.
> with an integer repeat, there should also be a restriction that n/i is
> an int, otherwise the repeat is not possible.
>
> if n//i != n/float(i): continue
>
> or mod == 0
I allowed for the sequence to have some incomplete part of the
repeated section at the tail end. If the sequence is a perfect
multiple, then you can avoid doing the expensive test if (n % i) != 0.
For a 1D sequence, you can also try reshaping:
for i in range(2, len(X)//2):
if (n % i) != 0:
continue
Y = X.reshape((-1, i))
if (Y == Y[0]).all():
break
For 2D sequences (probably):
rowlen = X.shape[1]
for i in range(2, len(X)//2):
if (n % i) != 0:
continue
Y = X.reshape((-1, i, rowlen))
if (Y == Y[0]).all():
break
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
-- Umberto Eco
More information about the SciPy-User
mailing list