<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=GB2312" http-equiv="Content-Type">
<title></title>
</head>
<body text="#000000" bgcolor="#ffffff">
On 6/20/2011 7:59 PM, <a class="moz-txt-link-abbreviated" href="mailto:king6cong@gmail.com">king6cong@gmail.com</a> wrote:
<blockquote
cite="mid:BANLkTikaUciL5X2o9BsnQDDBJCyq3KJ34A@mail.gmail.com"
type="cite">
<div>Hi£¬</div>
<div> I have two large files,each has more than 200000000
lines,and each line consists of two fields,one is the id and the
other a value,</div>
<div>the ids are sorted.</div>
<div><br>
</div>
<div>for example:</div>
<div><br>
</div>
<div>file1</div>
<div>(uin_a y)</div>
<div>1 10000245</div>
<div>2 12333</div>
<div>3 324543</div>
<div>5 3464565</div>
<div>....</div>
<div><br>
</div>
<div><br>
</div>
<div>file2 </div>
<div>(uin_b gift)</div>
<div>1 34545</div>
<div>3 6436466</div>
<div>4 35345646</div>
<div>5 463626</div>
<div>....</div>
<div><br>
</div>
<div>I want to merge them and get a file,the lines of which
consists of an id and the sum of the two values in file1 and
file2¡£</div>
<div>the codes are as below: </div>
<div><br>
</div>
<div>uin_y=open('file1')</div>
<div>uin_gift=open(file2')</div>
<div><br>
</div>
<div>y_line=uin_y.next()</div>
<div>gift_line=uin_gift.next()</div>
<div><br>
</div>
<div>
while 1:</div>
<div> try:</div>
<div> uin_a,y=[int(i) for i in y_line.split()]</div>
<div> uin_b,gift=[int(i) for i in gift_line.split()]</div>
<div> if uin_a==uin_b:</div>
<div> score=y+gift</div>
<div> print uin_a,score</div>
<div> y_line=uin_y.next()</div>
<div> gift_line=uin_gift.next()</div>
<div> if uin_a<uin_b:</div>
<div> print uin_a,y</div>
<div> y_line=uin_y.next()</div>
<div> if uin_a>uin_b:</div>
<div> print uin_b,gift</div>
<div> gift_line=uin_gift.next()</div>
<div> except StopIteration:</div>
<div> break</div>
<div><br>
</div>
<div><br>
</div>
<div>the question is that those code runs 40+ minutes on a
server(16 core,32G mem),</div>
<div>the <span class="Apple-style-span" style="font-family:
Helvetica,Arial,sans-serif; font-size: 13px; line-height:
18px;"><span class="failedUrl" style="word-wrap: break-word;">time
complexity is O(n),and there are not too much operations,</span></span></div>
<div><span class="Apple-style-span" style="font-family:
Helvetica,Arial,sans-serif; font-size: 13px; line-height:
18px;"><span class="failedUrl" style="word-wrap: break-word;">I
think it should be faster.So I want to ask which part costs
so much.</span></span></div>
<div><span class="Apple-style-span" style="font-family:
Helvetica,Arial,sans-serif; font-size: 13px; line-height:
18px;"><span class="failedUrl" style="word-wrap: break-word;">I
tried the cProfile module </span></span><span
class="Apple-style-span" style="font-family:
Helvetica,Arial,sans-serif; font-size: 13px; line-height:
18px;">but didn't get too much.</span></div>
<div><span class="Apple-style-span" style="font-family:
Helvetica,Arial,sans-serif; font-size: 13px; line-height:
18px;">I guess maybe it is the int() operation that cost so
much,but I'm not sure </span></div>
<div>
<span class="Apple-style-span" style="font-family:
Helvetica,Arial,sans-serif; font-size: 13px; line-height:
18px;"><span class="failedUrl" style="word-wrap: break-word;">and
don't know how to solve this.</span></span></div>
<div><span class="Apple-style-span" style="font-family:
Helvetica,Arial,sans-serif; font-size: 13px; line-height:
18px;"><span class="failedUrl" style="word-wrap: break-word;">Is
there a way to avoid type convertion in Python such as scanf
in C?</span></span></div>
<div><span class="Apple-style-span" style="font-family:
Helvetica,Arial,sans-serif; font-size: 13px; line-height:
18px;"><span class="failedUrl" style="word-wrap: break-word;">Thanks
for your help £º£©</span></span></div>
</blockquote>
<br>
Unfortunately python does not have a scanf equivalent AFAIK. Most
use cases for scanf can be handled by regular expressions, but that
would clearly useless for you, and just slow you down more since it
does not perform the int conversion for you.<br>
<br>
Your code appears to have a bug: I would expect that the last entry
will be lost unless both files end with the same index value. Be
sure to test your code on a few short test files.<br>
<br>
I recommend psyco to make the whole thing faster.<br>
<br>
Regards,<br>
Ken Seehart<br>
<br>
</body>
</html>