something about performence
Vijay Murthy
vijay.murthy at gmail.com
Tue Jun 21 05:31:18 EDT 2011
I just wrote something. I could not run a profiler or analyze the timing but
I felt it was effecient. Have a look and see if it helps:
from itertools import *
def sep_add(line1, line2):
if line1 and line2:
val1 = line1.split()
val2 = line2.split()
if (val1 and val2) and (len(val1) == len(val2) == 2):
return (val1[0] == val2[0])
def add_col_files(file1, file2):
fs1 = open(file1, "r")
fs2 = open(file2, "r")
fsn = open("new_sample.txt", "w")
# Zip the files together and find the ones that match the index
# process the tuple accordingly
# output should be what you want to write to the file
for k in ifilter(lambda (i,j): (sep_add(i,j)), izip(fs1, fs2)):
if k:
output = k[0] + " " + k[1] + "\n" # sample output
fsn.write(output)
fsn.close()
fs1.close()
fs2.close()
if __name__ == "__main__":
import time
start = time.localtime(time.time())
print time.asctime(start)
#add_col_files("sample1.txt", "sample3.txt")
end = time.localtime(time.time())
print time.asctime(end)
It took about a minute on my comp for comparing about 50-100M sized files.
As I said not done too much of testing on this code.
2011/6/21 Ken Seehart <ken at seehart.com>
> **
> On 6/20/2011 10:31 PM, Ken Seehart wrote:
>
> On 6/20/2011 7:59 PM, king6cong at gmail.com wrote:
>
> Hi,
> I have two large files,each has more than 200000000 lines,and each line
> consists of two fields,one is the id and the other a value,
> the ids are sorted.
>
> for example:
>
> file1
> (uin_a y)
> 1 10000245
> 2 12333
> 3 324543
> 5 3464565
> ....
>
>
> file2
> (uin_b gift)
> 1 34545
> 3 6436466
> 4 35345646
> 5 463626
> ....
>
> I want to merge them and get a file,the lines of which consists of an id
> and the sum of the two values in file1 and file2。
> the codes are as below:
>
> uin_y=open('file1')
> uin_gift=open(file2')
>
> y_line=uin_y.next()
> gift_line=uin_gift.next()
>
> while 1:
> try:
> uin_a,y=[int(i) for i in y_line.split()]
> uin_b,gift=[int(i) for i in gift_line.split()]
> if uin_a==uin_b:
> score=y+gift
> print uin_a,score
> y_line=uin_y.next()
> gift_line=uin_gift.next()
> if uin_a<uin_b:
> print uin_a,y
> y_line=uin_y.next()
> if uin_a>uin_b:
> print uin_b,gift
> gift_line=uin_gift.next()
> except StopIteration:
> break
>
>
> the question is that those code runs 40+ minutes on a server(16 core,32G
> mem),
> the time complexity is O(n),and there are not too much operations,
> I think it should be faster.So I want to ask which part costs so much.
> I tried the cProfile module but didn't get too much.
> I guess maybe it is the int() operation that cost so much,but I'm not sure
> and don't know how to solve this.
> Is there a way to avoid type convertion in Python such as scanf in C?
> Thanks for your help :)
>
>
> Unfortunately python does not have a scanf equivalent AFAIK. Most use cases
> for scanf can be handled by regular expressions, but that would clearly
> useless for you, and just slow you down more since it does not perform the
> int conversion for you.
>
> Your code appears to have a bug: I would expect that the last entry will be
> lost unless both files end with the same index value. Be sure to test your
> code on a few short test files.
>
> I recommend psyco to make the whole thing faster.
>
> Regards,
> Ken Seehart
>
> Another thought (a bit of extra work, but you might find it worthwhile if
> psyco doesn't yield a sufficient boost):
>
> Write a couple filter programs to convert to and from binary data (pairs of
> 32 or 64 bit integers depending on your requirements).
>
> Modify your program to use the subprocess module to open two instances of
> the binary conversion process with the two input files. Then pipe the output
> of that program into the binary to text filter.
>
> This might turn out to be faster since each process would make use of a
> core. Also it gives you other options, such as keeping your data in binary
> form for processing, and converting to text only as needed.
>
> Ken Seehart
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110621/c1206aa6/attachment-0001.html>
More information about the Python-list
mailing list