[Numpy-discussion] Help to process a large data file

Thu Oct 2 11:43:37 EDT 2008

Frank,

I would imagine that you cannot get a much better performance in python 
than this, which avoids string conversions:

c = []
count = 0
for line in open('foo'):
    if line == '1 1\n':
        c.append(count)
        count = 0
    else:
        if '1' in line: count += 1

One could do some numpy trick like:

a = np.loadtxt('foo',dtype=int)
a = np.sum(a,axis=1)    # Add the two columns horizontally
b = np.where(a==2)[0]   # Find with sum == 2 (1 + 1)
count = []
for i,j in zip(b[:-1],b[1:]):
    count.append( a[i+1:j].sum() )  # Calculate number of lines with 1

but on my machine the numpy version takes about 20 sec for a 'foo' file 
of 2,500,000 lines versus 1.2 sec for the pure python version...

As a side note, if i replace "line == '1 1\n'" with "line.startswith('1 
1')", the pure python version goes up to 1.8 sec... Isn't this a bit 
weird, i'd think startswith() should be faster...

Chris

On Wed, Oct 01, 2008 at 07:27:27PM -0600, frank wang wrote:

>    Hi,
> 
>    I have a large data file which contains 2 columns of data. The two 
>    columns only have zero and one. Now I want to cound how many one in 
>    between if both columns are one. For example, if my data is:
> 
>    1 0
>    0 0
>    1 1
>    0 0
>    0 1    x
>    0 1    x
>    0 0
>    0 1    x
>    1 1
>    0 0
>    0 1    x
>    0 1    x
>    1 1
> 
>    Then my count will be 3 and 2 (the numbers with x).
> 
>    Are there an efficient way to do this? My data file is pretty big.
> 
>    Thanks
> 
>    Frank