[Tutor] memory consumption

Wed Jul 3 21:50:38 CEST 2013

Hi Steven, Dave, Allen and All,

OK - forgive my poor terminology.  That is also something I am learning.
The three of you asked very similar questions - so I will just reply this once.

>> I wrote some code that is running out of memory.
> 
> How do you know? What are the symptoms? Do you get an exception? Computer crashes? Something else?

I am using mac OS, and have the "Activity Monitor" open, and can watch the memory slowly get consumed.  Then, a crash when it runs out, about 2/3 of the way through

Python(50827,0xa051c540) malloc: *** mmap(size=16777216) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Bus error

I have 8GB of RAM, and the underlying data file I start with is small (see below).

>> It involves a set of three nested loops, manipulating a data file (array) of dimension ~ 300 x 256 x 1 x 2.
> 
> Is it a data file, or an array? They're different things.

OK.  I begin with a data file (hdf5 to be specific).  I load a piece of the data file into memory as a numpy array.  The size of the array is 

300 x 256 x 1 x 2

and each element is a double precision float, so the total array is ~1.3MB

My code is not in the boiled-down-to-simplest-failing-state, which is why I attempted to just describe.  But I will get more specific below.

>> ################################################
>> # generic code skeleton
>> # import a class I wrote to utilize the 3rd party software
>> import my_class
> 
> Looking at the context here, "my_class" is a misleading name, since it's actually a module, not a class.

Yes.  In "my_class" I wrote a class.  I think the essentials are

##########################
# my_class  # perhaps a poor name, but I will stick with it for the re-post
''' 
The third_party software performs numerical minimization.  It has a function that constructs the function to be minimized, and then a routing to actually do the minimization.  It minimizes a chi^2 with respect to "other_vars"
'''
import third_party

class do_stuff: 
# I am aware this doesn't follow the class naming convention, just sticking with my previous post name
    def __call__(data,other_vars):
        self.fit = third_party.function_set_up(data,other_vars)

    def minimize(self):
        try:
            self.fit.minimize()
            self.have_fit = True
        except third_party.Error:
            self.have_fit = False
##########################

I should say the above construction works as I expect when I do a single fit.

> # instantiate the function do_stuff
>> my_func = my_class.do_stuff()
[snip]
>> # I am manipulating a data array of size ~ 300 x 256 x 1 x 2
>> data = my_data  # my_data is imported just once and has the size above
> 
> Where, and how, is my_data imported from? What is it? You say it is "a data array" (what sort of data array?) of size 300x256x1x2 -- that's a four-dimensional array, with 153600 entries. What sort of entries? Is that 153600 bytes (about 150K) or 153600 x 64-bit floats (about 1.3 MB)? Or 153600 data structures, each one holding 1MB of data (about 153 GB)?

see above - each element is a double precision (64) float, so 1.3 MB.

>> # instantiate a 3d array of size 20 x 10 x 10 and fill it with all zeros
>> my_array = numpy.zeros([20,10,10])
> 
> At last, we finally see something concrete! A numpy array. Is this the same sort of array used above?
> 
> 
>> # loop over parameters and fill array with desired output
>> for i in range(loop_1):
>>     for j in range(loop_2):
>>         for k in range(loop_3):
> 
> How big are loop_1, loop_2, loop_3?

The sizes of the loops are not big
len(loop_1) = 20
len(loop_2) = 10
len(loop_3) = 10

> You should consider using xrange() rather than range(). If the number is very large, xrange will be more memory efficient.
> 
> 
>>             # create tmp_data that has a shape which is the same as data except the first dimension can range from 1 - 1024 instead of being fixed at 300
>>             '''  Is the next line where I am causing memory problems? '''
>>             tmp_data = my_class.chop_data(data,i,j,k)
> 
> How can we possibly tell if chop_data is causing memory problems when you don't show us what chop_data does?

OK, below is chop data

###################
import numpy as np
def chop_data(data,N,M):
    n_orig = data.shape[0] # data is a numpy array, data.shape returns the size of each dimension
    data_new = []
    for b in range(N):
	# draw M random samples with repeat from original data and take the mean
	# np.random.randint(0,n_orig,M) creates a list ranging from 0 to n_orig-1 of size M
        data_new.append(data[np.random.randint(0,n_orig,M)].mean(axis=0))
    data_new = np.array(data_new) # convert to a numpy array
    return data_new
###################

To be clear, I will re-write the loop I had in the original post

###################
import my_class
import numpy
my_func = my_class.do_stuff()
other_vars = np.array([''' list of guess initial values for numerical minimization '''])
loop_lst = np.array([1,2,4,8,16,32,64,128,256,512,1024])
n_sample = 20
my_array = numpy.zeros([n_sample,len(loop_lst),len(loop_lst)])
for s in range(n_sample):
    for n_i,n in enumerate(loop_lst):
        for m_i,m in enumerate(loop_lst):
            tmp_data = my_class.chop_data(data,n,m)
            my_func(tmp_data,other_vars)
            my_func.minimize()
            my_array[s,n_i,m_i] = my_func.fit.results() # returns a 64-float, results is a built in third_party function

This all runs as expected, except each pass through the loops, the code consumes more and more memory.  Two-thirds through the outer loop, it crashes with the above mentioned memory error.

Hopefully, this is more clear.

Andre