Memory efficient tuple storage

psaffrey at googlemail.com psaffrey at googlemail.com
Fri Mar 13 14:13:29 EDT 2009


Thanks for all the replies.

First of all, can anybody recommend a good way to show memory usage? I
tried heapy, but couldn't make much sense of the output and it didn't
seem to change too much for different usages. Maybe I was just making
the h.heap() call in the wrong place. I also tried getrusage() in the
resource module. That seemed to give 0 for the shared and unshared
memory size no matter what I did. I was calling it after the function
call the filled up the lists. The memory figures I give in this
message come from top.

The numpy solution does work, but it uses more than 1GB of memory for
one of my 130MB files. I'm using

np.dtype({'names': ['chromo', 'position', 'dpoint'], 'formats': ['S6',
'i4', 'f8']})

so shouldn't it use 18 bytes per line? The file has 5832443 lines,
which by my arithmetic is around 100MB...?

My previous solution - using a python array for the numbers and a list
of tuples for the coordinates uses about 900MB. The dictionary
solution suggested by Tim got this down to 650MB. If I just ignore the
coordinates, this comes down to less than 100MB. I feel sure the list
mechanics for storing the coordinates is what is killing me here.

As to "work smarter", you could be right, but it's tricky. The 28
files are in 4 groups of 7, so given that each file is about 6 million
lines, each group of data points contains about 42 million points.
First, I need to divide every point by the median of its group. Then I
need to z-score the whole group of points.

After this preparation, I need to file each point, based on its
coordinates, into other data structures - the genome itself is divided
up into bins that cover a range of coordinates, and we file each point
into the appropriate bin for the coordinate region it overlaps. Then
there operations that combine the values from various bins. The
relevant coordinates for these combinations come from more enormous
csv files. I've already done all this analysis on smaller datasets, so
I'm hoping I won't have to make huge changes just to fit the data into
memory. Yes, I'm also finding out how much it will cost to upgrade to
32GB of memory :)

Sorry for the long message...

Peter



More information about the Python-list mailing list