something about performence
king6cong at gmail.com
king6cong at gmail.com
Mon Jun 20 22:59:48 EDT 2011
Hi,
I have two large files,each has more than 200000000 lines,and each line
consists of two fields,one is the id and the other a value,
the ids are sorted.
for example:
file1
(uin_a y)
1 10000245
2 12333
3 324543
5 3464565
....
file2
(uin_b gift)
1 34545
3 6436466
4 35345646
5 463626
....
I want to merge them and get a file,the lines of which consists of an id and
the sum of the two values in file1 and file2。
the codes are as below:
uin_y=open('file1')
uin_gift=open(file2')
y_line=uin_y.next()
gift_line=uin_gift.next()
while 1:
try:
uin_a,y=[int(i) for i in y_line.split()]
uin_b,gift=[int(i) for i in gift_line.split()]
if uin_a==uin_b:
score=y+gift
print uin_a,score
y_line=uin_y.next()
gift_line=uin_gift.next()
if uin_a<uin_b:
print uin_a,y
y_line=uin_y.next()
if uin_a>uin_b:
print uin_b,gift
gift_line=uin_gift.next()
except StopIteration:
break
the question is that those code runs 40+ minutes on a server(16 core,32G
mem),
the time complexity is O(n),and there are not too much operations,
I think it should be faster.So I want to ask which part costs so much.
I tried the cProfile module but didn't get too much.
I guess maybe it is the int() operation that cost so much,but I'm not sure
and don't know how to solve this.
Is there a way to avoid type convertion in Python such as scanf in C?
Thanks for your help :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110621/67b72f31/attachment.html>
More information about the Python-list
mailing list