[Numpy-discussion] Help to process a large data file
orionbelt2 at gmail.com
orionbelt2 at gmail.com
Thu Oct 2 11:43:37 EDT 2008
Frank,
I would imagine that you cannot get a much better performance in python
than this, which avoids string conversions:
c = []
count = 0
for line in open('foo'):
if line == '1 1\n':
c.append(count)
count = 0
else:
if '1' in line: count += 1
One could do some numpy trick like:
a = np.loadtxt('foo',dtype=int)
a = np.sum(a,axis=1) # Add the two columns horizontally
b = np.where(a==2)[0] # Find with sum == 2 (1 + 1)
count = []
for i,j in zip(b[:-1],b[1:]):
count.append( a[i+1:j].sum() ) # Calculate number of lines with 1
but on my machine the numpy version takes about 20 sec for a 'foo' file
of 2,500,000 lines versus 1.2 sec for the pure python version...
As a side note, if i replace "line == '1 1\n'" with "line.startswith('1
1')", the pure python version goes up to 1.8 sec... Isn't this a bit
weird, i'd think startswith() should be faster...
Chris
On Wed, Oct 01, 2008 at 07:27:27PM -0600, frank wang wrote:
> Hi,
>
> I have a large data file which contains 2 columns of data. The two
> columns only have zero and one. Now I want to cound how many one in
> between if both columns are one. For example, if my data is:
>
> 1 0
> 0 0
> 1 1
> 0 0
> 0 1 x
> 0 1 x
> 0 0
> 0 1 x
> 1 1
> 0 0
> 0 1 x
> 0 1 x
> 1 1
>
> Then my count will be 3 and 2 (the numbers with x).
>
> Are there an efficient way to do this? My data file is pretty big.
>
> Thanks
>
> Frank
More information about the NumPy-Discussion
mailing list