[Numpy-discussion] Help to process a large data file

frank wang f.yw at hotmail.com
Thu Oct 2 15:20:09 EDT 2008


Thans David and Chris for providing the nice solution.
 
Both method works gread. I could not tell the speed difference between the two solutions. My data size is 1048577 lines.
 
I did not try the second solution from Chris since it is too slow as Chris stated.
 
Frank
 
> Date: Thu, 2 Oct 2008 17:43:37 +0200> From: orionbelt2 at gmail.com> To: numpy-discussion at scipy.org> CC: orionbelt2 at gmail.com> Subject: Re: [Numpy-discussion] Help to process a large data file> > Frank,> > I would imagine that you cannot get a much better performance in python > than this, which avoids string conversions:> > c = []> count = 0> for line in open('foo'):> if line == '1 1\n':> c.append(count)> count = 0> else:> if '1' in line: count += 1> > One could do some numpy trick like:> > a = np.loadtxt('foo',dtype=int)> a = np.sum(a,axis=1) # Add the two columns horizontally> b = np.where(a==2)[0] # Find with sum == 2 (1 + 1)> count = []> for i,j in zip(b[:-1],b[1:]):> count.append( a[i+1:j].sum() ) # Calculate number of lines with 1> > but on my machine the numpy version takes about 20 sec for a 'foo' file > of 2,500,000 lines versus 1.2 sec for the pure python version...> > As a side note, if i replace "line == '1 1\n'" with "line.startswith('1 > 1')", the pure python version goes up to 1.8 sec... Isn't this a bit > weird, i'd think startswith() should be faster...> > Chris> > On Wed, Oct 01, 2008 at 07:27:27PM -0600, frank wang wrote:> > > Hi,> > > > I have a large data file which contains 2 columns of data. The two > > columns only have zero and one. Now I want to cound how many one in > > between if both columns are one. For example, if my data is:> > > > 1 0> > 0 0> > 1 1> > 0 0> > 0 1 x> > 0 1 x> > 0 0> > 0 1 x> > 1 1> > 0 0> > 0 1 x> > 0 1 x> > 1 1> > > > Then my count will be 3 and 2 (the numbers with x).> > > > Are there an efficient way to do this? My data file is pretty big.> > > > Thanks> > > > Frank> _______________________________________________> Numpy-discussion mailing list> Numpy-discussion at scipy.org> http://projects.scipy.org/mailman/listinfo/numpy-discussion
_________________________________________________________________
See how Windows connects the people, information, and fun that are part of your life.
http://clk.atdmt.com/MRT/go/msnnkwxp1020093175mrt/direct/01/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20081002/18e945d8/attachment.html>


More information about the NumPy-Discussion mailing list