something about performence

Tue Jun 21 05:31:18 EDT 2011

I just wrote something. I could not run a profiler or analyze the timing but
I felt it was effecient. Have a look and see if it helps:

from itertools import *

def sep_add(line1, line2):
    if line1 and line2:
        val1 = line1.split()
        val2 = line2.split()
        if (val1 and val2) and (len(val1) == len(val2) == 2):
            return (val1[0] == val2[0])

def add_col_files(file1, file2):
    fs1 = open(file1, "r")
    fs2 = open(file2, "r")

    fsn = open("new_sample.txt", "w")

    # Zip the files together and find the ones that match the index
    # process the tuple accordingly
    # output should be what you want to write to the file

    for k in ifilter(lambda (i,j): (sep_add(i,j)), izip(fs1, fs2)):
        if k:
            output = k[0] + " " + k[1] + "\n" # sample output
            fsn.write(output)

    fsn.close()
    fs1.close()
    fs2.close()

if __name__ == "__main__":
    import time
    start = time.localtime(time.time())
    print time.asctime(start)

    #add_col_files("sample1.txt", "sample3.txt")

    end = time.localtime(time.time())
    print time.asctime(end)

It took about a minute on my comp for comparing about 50-100M sized files.
As I said not done too much of testing on this code.

2011/6/21 Ken Seehart <ken at seehart.com>

> **
> On 6/20/2011 10:31 PM, Ken Seehart wrote:
>
> On 6/20/2011 7:59 PM, king6cong at gmail.com wrote:
>
> Hi，
>    I have two large files,each has more than 200000000 lines,and each line
> consists of two fields,one is the id and the other a value,
> the ids are sorted.
>
>  for example:
>
>  file1
> (uin_a y)
> 1 10000245
> 2  12333
> 3 324543
> 5 3464565
> ....
>
>
>  file2
> (uin_b gift)
> 1 34545
> 3 6436466
> 4 35345646
> 5 463626
> ....
>
>  I want to merge them and get a file,the lines of which consists of an id
> and the sum of the two values in file1 and file2。
> the codes are as below:
>
>  uin_y=open('file1')
> uin_gift=open(file2')
>
>  y_line=uin_y.next()
> gift_line=uin_gift.next()
>
>  while 1:
>     try:
>         uin_a,y=[int(i) for i in y_line.split()]
>         uin_b,gift=[int(i) for i in gift_line.split()]
>         if uin_a==uin_b:
>             score=y+gift
>             print uin_a,score
>             y_line=uin_y.next()
>             gift_line=uin_gift.next()
>         if uin_a<uin_b:
>             print uin_a,y
>             y_line=uin_y.next()
>         if uin_a>uin_b:
>             print uin_b,gift
>             gift_line=uin_gift.next()
>     except StopIteration:
>         break
>
>
>  the question is that those code runs 40+ minutes on a server(16 core,32G
> mem),
> the time complexity is O(n),and there are not too much operations,
> I think it should be faster.So I want to ask which part costs so much.
> I tried the cProfile module but didn't get too much.
> I guess maybe it is the int() operation that cost so much,but I'm not sure
>  and don't know how to solve this.
> Is there a way to avoid type convertion in Python such as scanf in C?
> Thanks for your help ：）
>
>
> Unfortunately python does not have a scanf equivalent AFAIK. Most use cases
> for scanf can be handled by regular expressions, but that would clearly
> useless for you, and just slow you down more since it does not perform the
> int conversion for you.
>
> Your code appears to have a bug: I would expect that the last entry will be
> lost unless both files end with the same index value. Be sure to test your
> code on a few short test files.
>
> I recommend psyco to make the whole thing faster.
>
> Regards,
> Ken Seehart
>
>  Another thought (a bit of extra work, but you might find it worthwhile if
> psyco doesn't yield a sufficient boost):
>
> Write a couple filter programs to convert to and from binary data (pairs of
> 32 or 64 bit integers depending on your requirements).
>
> Modify your program to use the subprocess module to open two instances of
> the binary conversion process with the two input files. Then pipe the output
> of that program into the binary to text filter.
>
> This might turn out to be faster since each process would make use of a
> core. Also it gives you other options, such as keeping your data in binary
> form for processing, and converting to text only as needed.
>
> Ken Seehart
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110621/c1206aa6/attachment-0001.html>