<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

  <head>

    <meta content="text/html; charset=GB2312" http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#ffffff">

    On 6/20/2011 11:56 PM, <a class="moz-txt-link-abbreviated" href="mailto:king6cong@gmail.com">king6cong@gmail.com</a> wrote:

    <blockquote

      cite="mid:BANLkTik-v7sugoas0eV4ZwyUQY5+zcoFTQ@mail.gmail.com"

      type="cite">Thanks for your reply,Ken :)

      <div>I found the two files ends with the same id……so I am lazy^-^<br>

        <div>I tried psyco,and unfortunately it costs nearly the same

          time as before.

          <div>Is it true that we can only get str from files in Python?</div>

          <div><br>

          </div>

        </div>

      </div>

    </blockquote>

    Nope<sup><sub>*</sub></sup>. There are many applications such as

    image processing that involve working with binary data.<br>

    <big><br>

    </big><small>(<sup><sub>*</sub></sup>well, technically yes actually:

      read() does in fact return str, but the str can contain binary

      data)</small><br>

    <br>

    But in order to do this, you need to use any of several modules that

    allow python to operate on flat data.<br>

    <br>

    Two standard modules exist for this purpose: <b>array </b>and <b>struct</b>. 

    In addition there are others such as <b>numpy </b>(for

    mathematical applications) and <b>ctypes </b>(for interoperability

    between python and C/C++).<br>

    <br>

    For your application, the <b>struct </b>module is sufficient.<br>

    <br>

    <tt>>>> fout = open('junk.dat', 'wb')   # open for writing

      binary<br>

      >>> fout.write(struct.pack('LL', 123,234))<br>

      >>> fout.write(struct.pack('LL', 123,234))<br>

      >>> fout.write(struct.pack('LL', 3,4))<br>

      >>> fout.close()<br>

      <br>

      >>> fin = open('junk.dat', 'rb')   # open for reading

      binary<br>

      >>> print struct.unpack('LL', fin.read(8))<br>

      (123, 234)<br>

      >>> print struct.unpack('LL', fin.read(8))<br>

      (123, 234)<br>

      >>> print struct.unpack('LL', fin.read(8))<br>

      (3, 4)<br>

      >>> print struct.unpack('LL', fin.read(8))   # raises

      struct.error at end of file (because 0 bytes were read)<br>

      Traceback (most recent call last):<br>

        File "<string>", line 1, in <fragment><br>

      struct.error: unpack requires a string argument of length 8<br>

      >>> </tt><br>

    <br>

    <br>

    <blockquote

      cite="mid:BANLkTik-v7sugoas0eV4ZwyUQY5+zcoFTQ@mail.gmail.com"

      type="cite">

      <div>

        <div>

          <div><br>

            <div class="gmail_quote">在 2011年6月21日 下午1:50，Ken Seehart <span

                dir="ltr"><<a moz-do-not-send="true"

                  href="mailto:ken@seehart.com">ken@seehart.com</a>></span>写

              道：<br>

              <blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt

                0.8ex; border-left: 1px solid rgb(204, 204, 204);

                padding-left: 1ex;">

                <div text="#000000" bgcolor="#ffffff">

                  <div>

                    <div class="h5"> On 6/20/2011 10:31 PM, Ken Seehart

                      wrote:

                      <blockquote type="cite"> On 6/20/2011 7:59 PM, <a

                          moz-do-not-send="true"

                          href="mailto:king6cong@gmail.com"

                          target="_blank">king6cong@gmail.com</a> wrote:

                        <blockquote type="cite">

                          <div>Hi，</div>

                          <div>   I have two large files,each has more

                            than 200000000 lines,and each line consists

                            of two fields,one is the id and the other a

                            value,</div>

                          <div>the ids are sorted.</div>

                          <div><br>

                          </div>

                          <div>for example:</div>

                          <div><br>

                          </div>

                          <div>file1</div>

                          <div>(uin_a y)</div>

                          <div>1 10000245</div>

                          <div>2  12333</div>

                          <div>3 324543</div>

                          <div>5 3464565</div>

                          <div>....</div>

                          <div><br>

                          </div>

                          <div><br>

                          </div>

                          <div>file2 </div>

                          <div>(uin_b gift)</div>

                          <div>1 34545</div>

                          <div>3 6436466</div>

                          <div>4 35345646</div>

                          <div>5 463626</div>

                          <div>....</div>

                          <div><br>

                          </div>

                          <div>I want to merge them and get a file,the

                            lines of which consists of an id and the sum

                            of the two values in file1 and file2。</div>

                          <div>the codes are as below: </div>

                          <div><br>

                          </div>

                          <div>uin_y=open('file1')</div>

                          <div>uin_gift=open(file2')</div>

                          <div><br>

                          </div>

                          <div>y_line=uin_y.next()</div>

                          <div>gift_line=uin_gift.next()</div>

                          <div><br>

                          </div>

                          <div> while 1:</div>

                          <div>    try:</div>

                          <div>        uin_a,y=[int(i) for i in

                            y_line.split()]</div>

                          <div>        uin_b,gift=[int(i) for i in

                            gift_line.split()]</div>

                          <div>        if uin_a==uin_b:</div>

                          <div>            score=y+gift</div>

                          <div>            print uin_a,score</div>

                          <div>            y_line=uin_y.next()</div>

                          <div>            gift_line=uin_gift.next()</div>

                          <div>        if uin_a<uin_b:</div>

                          <div>            print uin_a,y</div>

                          <div>            y_line=uin_y.next()</div>

                          <div>        if uin_a>uin_b:</div>

                          <div>            print uin_b,gift</div>

                          <div>            gift_line=uin_gift.next()</div>

                          <div>    except StopIteration:</div>

                          <div>        break</div>

                          <div><br>

                          </div>

                          <div><br>

                          </div>

                          <div>the question is that those code runs 40+

                            minutes on a server(16 core,32G mem),</div>

                          <div>the <span style="font-family:

                              Helvetica,Arial,sans-serif; font-size:

                              13px; line-height: 18px;"><span

                                style="word-wrap: break-word;">time

                                complexity is O(n),and there are not too

                                much operations,</span></span></div>

                          <div><span style="font-family:

                              Helvetica,Arial,sans-serif; font-size:

                              13px; line-height: 18px;"><span

                                style="word-wrap: break-word;">I think

                                it should be faster.So I want to ask

                                which part costs so much.</span></span></div>

                          <div><span style="font-family:

                              Helvetica,Arial,sans-serif; font-size:

                              13px; line-height: 18px;"><span

                                style="word-wrap: break-word;">I tried

                                the cProfile module </span></span><span

                              style="font-family:

                              Helvetica,Arial,sans-serif; font-size:

                              13px; line-height: 18px;">but didn't get

                              too much.</span></div>

                          <div><span style="font-family:

                              Helvetica,Arial,sans-serif; font-size:

                              13px; line-height: 18px;">I guess maybe it

                              is the int() operation that cost so

                              much,but I'm not sure </span></div>

                          <div> <span style="font-family:

                              Helvetica,Arial,sans-serif; font-size:

                              13px; line-height: 18px;"><span

                                style="word-wrap: break-word;">and don't

                                know how to solve this.</span></span></div>

                          <div><span style="font-family:

                              Helvetica,Arial,sans-serif; font-size:

                              13px; line-height: 18px;"><span

                                style="word-wrap: break-word;">Is there

                                a way to avoid type convertion in Python

                                such as scanf in C?</span></span></div>

                          <div><span style="font-family:

                              Helvetica,Arial,sans-serif; font-size:

                              13px; line-height: 18px;"><span

                                style="word-wrap: break-word;">Thanks

                                for your help ：）</span></span></div>

                        </blockquote>

                        <br>

                        Unfortunately python does not have a scanf

                        equivalent AFAIK. Most use cases for scanf can

                        be handled by regular expressions, but that

                        would clearly useless for you, and just slow you

                        down more since it does not perform the int

                        conversion for you.<br>

                        <br>

                        Your code appears to have a bug: I would expect

                        that the last entry will be lost unless both

                        files end with the same index value. Be sure to

                        test your code on a few short test files.<br>

                        <br>

                        I recommend psyco to make the whole thing

                        faster.<br>

                        <br>

                        Regards,<br>

                        Ken Seehart<br>

                        <br>

                      </blockquote>

                    </div>

                  </div>

                  Another thought (a bit of extra work, but you might

                  find it worthwhile if psyco doesn't yield a sufficient

                  boost):<br>

                  <br>

                  Write a couple filter programs to convert to and from

                  binary data (pairs of 32 or 64 bit integers depending

                  on your requirements).<br>

                  <br>

                  Modify your program to use the subprocess module to

                  open two instances of the binary conversion process

                  with the two input files. Then pipe the output of that

                  program into the binary to text filter.<br>

                  <br>

                  This might turn out to be faster since each process

                  would make use of a core. Also it gives you other

                  options, such as keeping your data in binary form for

                  processing, and converting to text only as needed.<br>

                  <br>

                  Ken Seehart<br>

                  <br>

                </div>

                <br>

                --<br>

                <a moz-do-not-send="true"

                  href="http://mail.python.org/mailman/listinfo/python-list"

                  target="_blank">http://mail.python.org/mailman/listinfo/python-list</a><br>

                <br>

              </blockquote>

            </div>

            <br>

          </div>

        </div>

      </div>

    </blockquote>

    <br>

  </body>

</html>