[Tutor] memory consumption

Wed Jul 3 22:25:10 CEST 2013

On 3 July 2013 20:50, Andre' Walker-Loud <walksloud at gmail.com> wrote:
> Hi Steven, Dave, Allen and All,
>
> OK - forgive my poor terminology.  That is also something I am learning.
> The three of you asked very similar questions - so I will just reply this once.
>
>
>>> I wrote some code that is running out of memory.
>>
>> How do you know? What are the symptoms? Do you get an exception? Computer crashes? Something else?
>
> I am using mac OS, and have the "Activity Monitor" open, and can watch the memory slowly get consumed.  Then, a crash when it runs out, about 2/3 of the way through
>
> Python(50827,0xa051c540) malloc: *** mmap(size=16777216) failed (error code=12)
> *** error: can't allocate region
> *** set a breakpoint in malloc_error_break to debug
> Bus error

The error is for creating an mmap object. This is not something that
numpy does unless you tell it to.

> I have 8GB of RAM, and the underlying data file I start with is small (see below).
>
>>> It involves a set of three nested loops, manipulating a data file (array) of dimension ~ 300 x 256 x 1 x 2.
>>
>> Is it a data file, or an array? They're different things.
>
> OK.  I begin with a data file (hdf5 to be specific).

So you're using pytables or h5py, or something else? It really would
help if you would specify this instead of trying to be generic. My
guess is that the hdf5 library loads the array as an mmap'ed memory
block and you're not actually working with an ordinary numpy array
(even if it has a similar interface).

> I load a piece of the data file into memory as a numpy array.  The size of the array is
>
> 300 x 256 x 1 x 2
>
> and each element is a double precision float, so the total array is ~1.3MB

Have you checked the actual memory size of the array? If it's a real
numpy array you can use the nbytes attribute:
>>> a = numpy.zeros([300, 256, 1, 2], float)
>>> a.nbytes
1228800

>
> My code is not in the boiled-down-to-simplest-failing-state, which is why I attempted to just describe.

I think that you should get into the boiled down state. Your attempt
to summarise misses too much relevant information.

> But I will get more specific below.
>
>>> ################################################
>>> # generic code skeleton
>>> # import a class I wrote to utilize the 3rd party software
>>> import my_class
>>
>> Looking at the context here, "my_class" is a misleading name, since it's actually a module, not a class.
>
> Yes.  In "my_class" I wrote a class.  I think the essentials are
>
> ##########################
> # my_class  # perhaps a poor name, but I will stick with it for the re-post
> '''
> The third_party software performs numerical minimization.  It has a function that constructs the function to be minimized, and then a routing to actually do the minimization.  It minimizes a chi^2 with respect to "other_vars"
> '''
> import third_party
>
> class do_stuff:
> # I am aware this doesn't follow the class naming convention, just sticking with my previous post name
>     def __call__(data,other_vars):
>         self.fit = third_party.function_set_up(data,other_vars)
>
>     def minimize(self):
>         try:
>             self.fit.minimize()
>             self.have_fit = True
>         except third_party.Error:
>             self.have_fit = False
> ##########################

If you write code like the above then you really cannot expect other
people to just understand what you mean if you don't show them the
code. Specifically the use of __call__ is confusing. Really, though,
this class is just a distraction from your problem and should have
been simplified away.

>
> I should say the above construction works as I expect when I do a single fit.
>
>> # instantiate the function do_stuff
>>> my_func = my_class.do_stuff()
> [snip]
>>> # I am manipulating a data array of size ~ 300 x 256 x 1 x 2
>>> data = my_data  # my_data is imported just once and has the size above
>>
>> Where, and how, is my_data imported from? What is it? You say it is "a data array" (what sort of data array?) of size 300x256x1x2 -- that's a four-dimensional array, with 153600 entries. What sort of entries? Is that 153600 bytes (about 150K) or 153600 x 64-bit floats (about 1.3 MB)? Or 153600 data structures, each one holding 1MB of data (about 153 GB)?
>
> see above - each element is a double precision (64) float, so 1.3 MB.
>
>>> # instantiate a 3d array of size 20 x 10 x 10 and fill it with all zeros
>>> my_array = numpy.zeros([20,10,10])
>>
>> At last, we finally see something concrete! A numpy array. Is this the same sort of array used above?
>>
>>
>>> # loop over parameters and fill array with desired output
>>> for i in range(loop_1):
>>>     for j in range(loop_2):
>>>         for k in range(loop_3):
>>
>> How big are loop_1, loop_2, loop_3?
>
> The sizes of the loops are not big
> len(loop_1) = 20
> len(loop_2) = 10
> len(loop_3) = 10
>
>> You should consider using xrange() rather than range(). If the number is very large, xrange will be more memory efficient.

Or numpy.nditer in this case.

>>
>>
>>>             # create tmp_data that has a shape which is the same as data except the first dimension can range from 1 - 1024 instead of being fixed at 300
>>>             '''  Is the next line where I am causing memory problems? '''
>>>             tmp_data = my_class.chop_data(data,i,j,k)
>>
>> How can we possibly tell if chop_data is causing memory problems when you don't show us what chop_data does?
>
> OK, below is chop data
>
> ###################
> import numpy as np
> def chop_data(data,N,M):
>     n_orig = data.shape[0] # data is a numpy array, data.shape returns the size of each dimension
>     data_new = []
>     for b in range(N):
>         # draw M random samples with repeat from original data and take the mean
>         # np.random.randint(0,n_orig,M) creates a list ranging from 0 to n_orig-1 of size M
>         data_new.append(data[np.random.randint(0,n_orig,M)].mean(axis=0))
>     data_new = np.array(data_new) # convert to a numpy array

The above function can be vectorised to something like:

def chop_data(data, N, M):
    return data[np.random.randint(0, data.shape[0], (M, N))].mean(axis=0)

>     return data_new
> ###################
>
> To be clear, I will re-write the loop I had in the original post
>
> ###################
> import my_class
> import numpy
> my_func = my_class.do_stuff()
> other_vars = np.array([''' list of guess initial values for numerical minimization '''])
> loop_lst = np.array([1,2,4,8,16,32,64,128,256,512,1024])
> n_sample = 20
> my_array = numpy.zeros([n_sample,len(loop_lst),len(loop_lst)])
> for s in range(n_sample):
>     for n_i,n in enumerate(loop_lst):
>         for m_i,m in enumerate(loop_lst):
>             tmp_data = my_class.chop_data(data,n,m)

Where did data come from? Is that the mmap'ed array from the hdf5 library?

>             my_func(tmp_data,other_vars)
>             my_func.minimize()

I now know that the above two lines call thirdparty.function_setup()
and my_func.fit.minimize(). I still have no idea what they do though.

>             my_array[s,n_i,m_i] = my_func.fit.results() # returns a 64-float, results is a built in third_party function

> This all runs as expected, except each pass through the loops, the code consumes more and more memory.  Two-thirds through the outer loop, it crashes with the above mentioned memory error.
>
>
> Hopefully, this is more clear.

Only slightly. The details that you choose to include are not the ones
that are needed to understand your problem. Instead of paraphrasing
simplify this into a short but *complete* example and post that.

Oscar