<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta content="text/html; charset=GB2312" http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#ffffff">

    On 6/20/2011 10:31 PM, Ken Seehart wrote:

    <blockquote cite="mid:4E002C94.2000604@seehart.com" type="cite">

      <meta content="text/html; charset=GB2312"

        http-equiv="Content-Type">

      <title></title>

      On 6/20/2011 7:59 PM, <a moz-do-not-send="true"

        class="moz-txt-link-abbreviated"

        href="mailto:king6cong@gmail.com">king6cong@gmail.com</a> wrote:

      <blockquote

        cite="mid:BANLkTikaUciL5X2o9BsnQDDBJCyq3KJ34A@mail.gmail.com"

        type="cite">

        <div>Hi，</div>

        <div>   I have two large files,each has more than 200000000

          lines,and each line consists of two fields,one is the id and

          the other a value,</div>

        <div>the ids are sorted.</div>

        <div><br>

        </div>

        <div>for example:</div>

        <div><br>

        </div>

        <div>file1</div>

        <div>(uin_a y)</div>

        <div>1 10000245</div>

        <div>2  12333</div>

        <div>3 324543</div>

        <div>5 3464565</div>

        <div>....</div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div>file2 </div>

        <div>(uin_b gift)</div>

        <div>1 34545</div>

        <div>3 6436466</div>

        <div>4 35345646</div>

        <div>5 463626</div>

        <div>....</div>

        <div><br>

        </div>

        <div>I want to merge them and get a file,the lines of which

          consists of an id and the sum of the two values in file1 and

          file2。</div>

        <div>the codes are as below: </div>

        <div><br>

        </div>

        <div>uin_y=open('file1')</div>

        <div>uin_gift=open(file2')</div>

        <div><br>

        </div>

        <div>y_line=uin_y.next()</div>

        <div>gift_line=uin_gift.next()</div>

        <div><br>

        </div>

        <div> while 1:</div>

        <div>    try:</div>

        <div>        uin_a,y=[int(i) for i in y_line.split()]</div>

        <div>        uin_b,gift=[int(i) for i in gift_line.split()]</div>

        <div>        if uin_a==uin_b:</div>

        <div>            score=y+gift</div>

        <div>            print uin_a,score</div>

        <div>            y_line=uin_y.next()</div>

        <div>            gift_line=uin_gift.next()</div>

        <div>        if uin_a<uin_b:</div>

        <div>            print uin_a,y</div>

        <div>            y_line=uin_y.next()</div>

        <div>        if uin_a>uin_b:</div>

        <div>            print uin_b,gift</div>

        <div>            gift_line=uin_gift.next()</div>

        <div>    except StopIteration:</div>

        <div>        break</div>

        <div><br>

        </div>

        <div><br>

        </div>

        <div>the question is that those code runs 40+ minutes on a

          server(16 core,32G mem),</div>

        <div>the <span class="Apple-style-span" style="font-family:

            Helvetica,Arial,sans-serif; font-size: 13px; line-height:

            18px;"><span class="failedUrl" style="word-wrap:

              break-word;">time complexity is O(n),and there are not too

              much operations,</span></span></div>

        <div><span class="Apple-style-span" style="font-family:

            Helvetica,Arial,sans-serif; font-size: 13px; line-height:

            18px;"><span class="failedUrl" style="word-wrap:

              break-word;">I think it should be faster.So I want to ask

              which part costs so much.</span></span></div>

        <div><span class="Apple-style-span" style="font-family:

            Helvetica,Arial,sans-serif; font-size: 13px; line-height:

            18px;"><span class="failedUrl" style="word-wrap:

              break-word;">I tried the cProfile module </span></span><span

            class="Apple-style-span" style="font-family:

            Helvetica,Arial,sans-serif; font-size: 13px; line-height:

            18px;">but didn't get too much.</span></div>

        <div><span class="Apple-style-span" style="font-family:

            Helvetica,Arial,sans-serif; font-size: 13px; line-height:

            18px;">I guess maybe it is the int() operation that cost so

            much,but I'm not sure </span></div>

        <div> <span class="Apple-style-span" style="font-family:

            Helvetica,Arial,sans-serif; font-size: 13px; line-height:

            18px;"><span class="failedUrl" style="word-wrap:

              break-word;">and don't know how to solve this.</span></span></div>

        <div><span class="Apple-style-span" style="font-family:

            Helvetica,Arial,sans-serif; font-size: 13px; line-height:

            18px;"><span class="failedUrl" style="word-wrap:

              break-word;">Is there a way to avoid type convertion in

              Python such as scanf in C?</span></span></div>

        <div><span class="Apple-style-span" style="font-family:

            Helvetica,Arial,sans-serif; font-size: 13px; line-height:

            18px;"><span class="failedUrl" style="word-wrap:

              break-word;">Thanks for your help ：）</span></span></div>

      </blockquote>

      <br>

      Unfortunately python does not have a scanf equivalent AFAIK. Most

      use cases for scanf can be handled by regular expressions, but

      that would clearly useless for you, and just slow you down more

      since it does not perform the int conversion for you.<br>

      <br>

      Your code appears to have a bug: I would expect that the last

      entry will be lost unless both files end with the same index

      value. Be sure to test your code on a few short test files.<br>

      <br>

      I recommend psyco to make the whole thing faster.<br>

      <br>

      Regards,<br>

      Ken Seehart<br>

      <br>

    </blockquote>

    Another thought (a bit of extra work, but you might find it

    worthwhile if psyco doesn't yield a sufficient boost):<br>

    <br>

    Write a couple filter programs to convert to and from binary data

    (pairs of 32 or 64 bit integers depending on your requirements).<br>

    <br>

    Modify your program to use the subprocess module to open two

    instances of the binary conversion process with the two input files.

    Then pipe the output of that program into the binary to text filter.<br>

    <br>

    This might turn out to be faster since each process would make use

    of a core. Also it gives you other options, such as keeping your

    data in binary form for processing, and converting to text only as

    needed.<br>

    <br>

    Ken Seehart<br>

    <br>

  </body>

</html>