[SciPy-User] Identify unique sequence data from array
otrov
dejan.org at gmail.com
Wed Dec 22 12:47:19 EST 2010
Hi,
I tried to seek for help on three other lists, but as this problem apparently can't be easily solved in matlab/octave(!?), I thought to try scipy/numpy and maybe gain advantage from python as more feature rich descriptive language
The problem:
I have 2D data sets (scipy/numpy arrays) of 10^7 to 10^8 rows, which consists of repeated sequences of one unique sequence, usually ~10^5 rows, but may differ in scale. Period is same for both columns, so there is not really difference if we consider 2D or 1D array.
I want to track this data block.
Simplified problem:
X = array([[1, 2],
[1, 2],
[2, 2],
[3, 1],
[2, 3],
[1, 2],
[1, 2],
[2, 2],
[3, 1],
[2, 3],
[1, 2],
[1, 2],
[2, 2],
[3, 1],
[2, 3],
...,
[1, 2],
[1, 2],
[2, 2],
[3, 1],
[2, 3]]
I would like to extract repeated sequence data:
Y = array([[1, 2],
[1, 2],
[2, 2],
[3, 1],
[2, 3]]
as a result.
Or presented more visually:
I want to identify unique sequence data:
A B C D D D A B C D D D A B C D D D
|_________| |_________| |_________|
| | |
unique unique unique
sequence sequence sequence
data data data
Thanks for your time
More information about the SciPy-User
mailing list