Memory efficient tuple storage

Fri Mar 13 16:40:29 EDT 2009

On Fri, Mar 13, 2009 at 1:13 PM, psaffrey at googlemail.com
<psaffrey at googlemail.com> wrote:
> Thanks for all the replies.
>
[snip]
>
> The numpy solution does work, but it uses more than 1GB of memory for
> one of my 130MB files. I'm using
>
> np.dtype({'names': ['chromo', 'position', 'dpoint'], 'formats': ['S6',
> 'i4', 'f8']})
>
> so shouldn't it use 18 bytes per line? The file has 5832443 lines,
> which by my arithmetic is around 100MB...?

I made a mock up file with 5832443 lines, each line consisting of

abcdef 100 100.0

and ran the g2arr() function with 'S6' for the string.  While running
(which took really long), the memory usage spiked on my computer to
around 800MB, but once g2arr() returned, the memory usage went to
around 200MB.  The number of bytes consumed by the array is 105MB
(using arr.nbytes).  From looking at the loadtxt routine in numpy, it
looks like there are a zillion objects created (string objects for
splitting each line, temporary ints floats and strings for type
conversions, etc) while in the routine which are garbage collected
upon return.  I'm not well versed in Python's internal memory
managment system, but from what I understand, practically all that
memory is either returned to the OS or held onto by Python for future
use by other objects after the routine returns.  But the only memory
in use by the array is the ~100MB for the raw data.

Making 5 copies of the array (using numpy.copy(arr)) bumps total
memory usage (from top) up to 700MB, which is 117MB per array or so.
The total memory reported by summing the arr.nbytes is 630MB (105MB /
array), so there isn't that much memory wasted.  Basically, the numpy
solution will pack the data into an array of C structs with the fields
as indicated by the dtype parameter.

Perhaps a database solution as mentioned in other posts would suit you
better; if the temporary spike in memory usage is unacceptable you
could try to roll your own loadtxt function that would be leaner and
meaner.  I suggest the numpy solution for its ease and efficient use
of memory.

Kurt