<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<blockquote type="cite">
<pre wrap="">I'm a bit surprised you aren't beyond the 2gb limit, just with the
structures you describe for the file. You do realize that each object
has quite a few bytes of overhead, so it's not surprising to use several
times the size of a file, to store the file in an organized way. </pre>
</blockquote>
I did some back of the envelope calcs which more or less agreed with
heapy. The code stores 1 string, which is, on average, about 50
chars or so, and one MD5 hex string per line of code. There's about
40 bytes or so of overhead per string per sys.getsizeof(). I'm also
storing an int (24b) and a <10 char string in an object with
__slots__ set. Each object, per heapy (this is one area where I
might be underestimating things) takes 64 bytes plus instance
variable storage, so per line:<br>
<br>
50 + 32 + 10 + 3 * 40 + 24 + 64 = 300 bytes per line * 2M lines =
~600MB plus some memory for the dicts, which is about what heapy is
reporting (note I'm currently not actually running all 2M lines, I'm
just running subsets for my tests).<br>
<br>
Is there something I'm missing? Here's the heapy output after
loading ~300k lines:<br>
<br>
Partition of a set of 1199849 objects. Total size = 89965376 bytes.<br>
<table style="border-collapse: collapse;width:384pt" border="0"
cellpadding="0" cellspacing="0" width="512">
<colgroup><col style="width:48pt" span="8" width="64"> </colgroup><tbody>
<tr style="height:15.0pt" height="20">
<td style="height:15.0pt;width:48pt" height="20" width="64">Index</td>
<td style="width:48pt" width="64">Count</td>
<td style="width:48pt" width="64">%</td>
<td style="width:48pt" width="64">Size</td>
<td style="width:48pt" width="64">%</td>
<td style="width:48pt" width="64">Cumulative</td>
<td style="width:48pt" width="64">%</td>
<td style="width:48pt" width="64">Kind</td>
</tr>
<tr style="height:15.0pt" height="20">
<td style="height:15.0pt" align="right" height="20">0</td>
<td align="right">599999</td>
<td align="right">50</td>
<td align="right">38399920</td>
<td align="right">43</td>
<td align="right">38399920</td>
<td align="right">43</td>
<td>str</td>
</tr>
<tr style="height:15.0pt" height="20">
<td style="height:15.0pt" align="right" height="20">1</td>
<td align="right">5</td>
<td align="right">0</td>
<td align="right">25167224</td>
<td align="right">28</td>
<td align="right">63567144</td>
<td align="right">71</td>
<td>dict</td>
</tr>
<tr style="height:15.0pt" height="20">
<td style="height:15.0pt" align="right" height="20">2</td>
<td align="right">299998</td>
<td align="right">25</td>
<td align="right">19199872</td>
<td align="right">21</td>
<td align="right">82767016</td>
<td align="right">92</td>
<td>0xa13330</td>
</tr>
<tr style="height:15.0pt" height="20">
<td style="height:15.0pt" align="right" height="20">3</td>
<td align="right">299836</td>
<td align="right">25</td>
<td align="right">7196064</td>
<td align="right">8</td>
<td align="right">89963080</td>
<td align="right">100</td>
<td>int</td>
</tr>
<tr style="height:15.0pt" height="20">
<td style="height:15.0pt" align="right" height="20">4</td>
<td align="right">4</td>
<td align="right">0</td>
<td align="right">1152</td>
<td align="right">0</td>
<td align="right">89964232</td>
<td align="right">100</td>
<td>collections.defaultdict</td>
</tr>
</tbody>
</table>
Note that 3 of the dicts are empty. I assume that 0xa13330 is the
address of the object. I'd actually expect to see 900k strings, but
the <10 char string is always the same in this case so perhaps
the runtime is using the same object...? At this point, top reports
python as using 1.1g of virt and 1.0g of res.<br>
<br>
<blockquote type="cite">
<pre wrap="">I also
wonder if heapy has been written to take into account the larger size of
pointers in a 64bit build.</pre>
</blockquote>
That I don't know, but that would only explain, at most, a 2x
increase in memory over the heapy report, wouldn't it? Not the ~10x
I'm seeing.<br>
<br>
<blockquote type="cite">
<pre wrap="">Another thing is to make sure
that the md5 object used in your two maps is the same object, and not
just one with the same value.</pre>
</blockquote>
That's certainly the way the code is written, and heapy seems to
confirm that the strings aren't duplicated in memory.<br>
<br>
Thanks for sticking with me on this,<br>
<br>
MrsE<br>
<br>
On 9/25/2012 4:06 AM, Dave Angel wrote:
<blockquote cite="mid:50619035.3080106@davea.name" type="cite">
<pre wrap="">On 09/25/2012 12:21 AM, Junkshops wrote:
</pre>
<blockquote type="cite">
<blockquote type="cite">
<pre wrap="">Just curious; which is it, two million lines, or half a million bytes?
</pre>
</blockquote>
</blockquote>
<pre wrap=""><snip>
</pre>
<blockquote type="cite">
<pre wrap="">
Sorry, that should've been a 500Mb, 2M line file.
</pre>
<blockquote type="cite">
<pre wrap="">which machine is 2gb, the Windows machine, or the VM?
</pre>
</blockquote>
<pre wrap="">VM. Winders is 4gb.
</pre>
<blockquote type="cite">
<pre wrap="">...but I would point out that just because
you free up the memory from the Python doesn't mean it gets released
back to the system. The C runtime manages its own heap, and is pretty
persistent about hanging onto memory once obtained. It's not normally a
problem, since most small blocks are reused. But it can get
fragmented. And i have no idea how well Virtual Box maps the Linux
memory map into the Windows one.
</pre>
</blockquote>
<pre wrap="">Right, I understand that - but what's confusing me is that, given the
memory use is (I assume) monotonically increasing, the code should never
use more than what's reported by heapy once all the data is loaded into
memory, given that memory released by the code to the Python runtime is
reused. To the best of my ability to tell I'm not storing anything I
shouldn't, so the only thing I can think of is that all the object
creation and destruction, for some reason, it preventing reuse of
memory. I'm at a bit of a loss regarding what to try next.
</pre>
</blockquote>
<pre wrap="">
I'm not familiar with heapy, but perhaps it's missing something there.
I'm a bit surprised you aren't beyond the 2gb limit, just with the
structures you describe for the file. You do realize that each object
has quite a few bytes of overhead, so it's not surprising to use several
times the size of a file, to store the file in an organized way. I also
wonder if heapy has been written to take into account the larger size of
pointers in a 64bit build.
Perhaps one way to save space would be to use a long to store those md5
values. You'd have to measure it, but I suspect it'd help (at the cost
of lots of extra hexlify-type calls). Another thing is to make sure
that the md5 object used in your two maps is the same object, and not
just one with the same value.
</pre>
</blockquote>
</body>
</html>