[SciPy-User] Identify unique sequence data from array

Wed Dec 22 15:57:07 EST 2010

On Wed, Dec 22, 2010 at 15:27,  <josef.pktd at gmail.com> wrote:
> On Wed, Dec 22, 2010 at 3:18 PM, otrov <dejan.org at gmail.com> wrote:
>>>> The problem:
>>
>>>> I have 2D data sets (scipy/numpy arrays) of 10^7 to 10^8 rows, which consists of repeated sequences of one unique sequence, usually ~10^5 rows, but may differ in scale. Period is same for both columns, so there is not really difference if we consider 2D or 1D array.
>>>> I want to track this data block.
>>
>>> for i in range(1, len(X)-1):
>>>     if (X[i:] == X[:-i]).all():
>>>         break
>
> I don't see how this works, isn't it
>
> (X[:i] == X[-i:]).all():

Not if the repeated subsequence is [1, 2, 3, 1, 2]. That said, my
method probably also has a counterexample.

> with an integer repeat, there should also be a restriction that n/i is
> an int, otherwise the repeat is not possible.
>
> if n//i != n/float(i): continue
>
> or mod == 0

I allowed for the sequence to have some incomplete part of the
repeated section at the tail end. If the sequence is a perfect
multiple, then you can avoid doing the expensive test if (n % i) != 0.

For a 1D sequence, you can also try reshaping:

for i in range(2, len(X)//2):
    if (n % i) != 0:
        continue
    Y = X.reshape((-1, i))
    if (Y == Y[0]).all():
        break

For 2D sequences (probably):

rowlen = X.shape[1]
for i in range(2, len(X)//2):
    if (n % i) != 0:
        continue
    Y = X.reshape((-1, i, rowlen))
    if (Y == Y[0]).all():
        break

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco