<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <blockquote type="cite">

      <pre wrap="">I'm a bit surprised you aren't beyond the 2gb limit, just with the

structures you describe for the file.  You do realize that each object

has quite a few bytes of overhead, so it's not surprising to use several

times the size of a file, to store the file in an organized way.  </pre>

    </blockquote>

    I did some back of the envelope calcs which more or less agreed with

    heapy. The code stores 1 string, which is, on average, about 50

    chars or so, and one MD5 hex string per line of code. There's about

    40 bytes or so of overhead per string per sys.getsizeof(). I'm also

    storing an int (24b) and a <10 char string in an object with

    __slots__ set. Each object, per heapy (this is one area where I

    might be underestimating things) takes 64 bytes plus instance

    variable storage, so per line:<br>

    <br>

    50 + 32 + 10 + 3 * 40 + 24 + 64 = 300 bytes per line * 2M lines =

    ~600MB plus some memory for the dicts, which is about what heapy is

    reporting (note I'm currently not actually running all 2M lines, I'm

    just running subsets for my tests).<br>

    <br>

    Is there something I'm missing? Here's the heapy output after

    loading ~300k lines:<br>

    <br>

    Partition of a set of 1199849 objects. Total size = 89965376 bytes.<br>

    <table style="border-collapse: collapse;width:384pt" border="0"

      cellpadding="0" cellspacing="0" width="512">

      <colgroup><col style="width:48pt" span="8" width="64"> </colgroup><tbody>

        <tr style="height:15.0pt" height="20">

          <td style="height:15.0pt;width:48pt" height="20" width="64">Index</td>

          <td style="width:48pt" width="64">Count</td>

          <td style="width:48pt" width="64">%</td>

          <td style="width:48pt" width="64">Size</td>

          <td style="width:48pt" width="64">%</td>

          <td style="width:48pt" width="64">Cumulative</td>

          <td style="width:48pt" width="64">%</td>

          <td style="width:48pt" width="64">Kind</td>

        </tr>

        <tr style="height:15.0pt" height="20">

          <td style="height:15.0pt" align="right" height="20">0</td>

          <td align="right">599999</td>

          <td align="right">50</td>

          <td align="right">38399920</td>

          <td align="right">43</td>

          <td align="right">38399920</td>

          <td align="right">43</td>

          <td>str</td>

        </tr>

        <tr style="height:15.0pt" height="20">

          <td style="height:15.0pt" align="right" height="20">1</td>

          <td align="right">5</td>

          <td align="right">0</td>

          <td align="right">25167224</td>

          <td align="right">28</td>

          <td align="right">63567144</td>

          <td align="right">71</td>

          <td>dict</td>

        </tr>

        <tr style="height:15.0pt" height="20">

          <td style="height:15.0pt" align="right" height="20">2</td>

          <td align="right">299998</td>

          <td align="right">25</td>

          <td align="right">19199872</td>

          <td align="right">21</td>

          <td align="right">82767016</td>

          <td align="right">92</td>

          <td>0xa13330</td>

        </tr>

        <tr style="height:15.0pt" height="20">

          <td style="height:15.0pt" align="right" height="20">3</td>

          <td align="right">299836</td>

          <td align="right">25</td>

          <td align="right">7196064</td>

          <td align="right">8</td>

          <td align="right">89963080</td>

          <td align="right">100</td>

          <td>int</td>

        </tr>

        <tr style="height:15.0pt" height="20">

          <td style="height:15.0pt" align="right" height="20">4</td>

          <td align="right">4</td>

          <td align="right">0</td>

          <td align="right">1152</td>

          <td align="right">0</td>

          <td align="right">89964232</td>

          <td align="right">100</td>

          <td>collections.defaultdict</td>

        </tr>

      </tbody>

    </table>

    Note that 3 of the dicts are empty. I assume that 0xa13330 is the

    address of the object. I'd actually expect to see 900k strings, but

    the <10 char string is always the same in this case so perhaps

    the runtime is using the same object...? At this point, top reports

    python as using 1.1g of virt and 1.0g of res.<br>

    <br>

    <blockquote type="cite">

      <pre wrap="">I also

wonder if heapy has been written to take into account the larger size of

pointers in a 64bit build.</pre>

    </blockquote>

    That I don't know, but that would only explain, at most, a 2x

    increase in memory over the heapy report, wouldn't it? Not the ~10x

    I'm seeing.<br>

    <br>

    <blockquote type="cite">

      <pre wrap="">Another thing is to make sure

that the md5 object used in your two maps is the same object, and not

just one with the same value.</pre>

    </blockquote>

    That's certainly the way the code is written, and heapy seems to

    confirm that the strings aren't duplicated in memory.<br>

    <br>

    Thanks for sticking with me on this,<br>

    <br>

    MrsE<br>

    <br>

    On 9/25/2012 4:06 AM, Dave Angel wrote:

    <blockquote cite="mid:50619035.3080106@davea.name" type="cite">

      <pre wrap="">On 09/25/2012 12:21 AM, Junkshops wrote:

</pre>

      <blockquote type="cite">

        <blockquote type="cite">

          <pre wrap="">Just curious;  which is it, two million lines, or half a million bytes?

</pre>

        </blockquote>

      </blockquote>

      <pre wrap=""><snip>

</pre>

      <blockquote type="cite">

        <pre wrap="">

Sorry, that should've been a 500Mb, 2M line file.

</pre>

        <blockquote type="cite">

          <pre wrap="">which machine is 2gb, the Windows machine, or the VM?

</pre>

        </blockquote>

        <pre wrap="">VM. Winders is 4gb.

</pre>

        <blockquote type="cite">

          <pre wrap="">...but I would point out that just because

you free up the memory from the Python doesn't mean it gets released

back to the system.  The C runtime manages its own heap, and is pretty

persistent about hanging onto memory once obtained.  It's not normally a

problem, since most small blocks are reused.  But it can get

fragmented.  And i have no idea how well Virtual Box maps the Linux

memory map into the Windows one.

</pre>

        </blockquote>

        <pre wrap="">Right, I understand that - but what's confusing me is that, given the

memory use is (I assume) monotonically increasing, the code should never

use more than what's reported by heapy once all the data is loaded into

memory, given that memory released by the code to the Python runtime is

reused. To the best of my ability to tell I'm not storing anything I

shouldn't, so the only thing I can think of is that all the object

creation and destruction, for some reason, it preventing reuse of

memory. I'm at a bit of a loss regarding what to try next.

</pre>

      </blockquote>

      <pre wrap="">

I'm not familiar with heapy, but perhaps it's missing something there.

I'm a bit surprised you aren't beyond the 2gb limit, just with the

structures you describe for the file.  You do realize that each object

has quite a few bytes of overhead, so it's not surprising to use several

times the size of a file, to store the file in an organized way.  I also

wonder if heapy has been written to take into account the larger size of

pointers in a 64bit build.

Perhaps one way to save space would be to use a long to store those md5

values.  You'd have to measure it, but I suspect it'd help (at the cost

of lots of extra hexlify-type calls).  Another thing is to make sure

that the md5 object used in your two maps is the same object, and not

just one with the same value.

</pre>

    </blockquote>

  </body>

</html>