numpy large arrays?
Hi all
I need to perform computations involving large arrays. A lot of rows and no more than e.g. 34 columns. My first choice is python/numpy because I'm already used to code in matlab.
However I'm experiencing memory problems even though there is still 500 MB available (2 GB total). I have cooked down my code to following meaningless code snip. This code share some of the same structure and calls as my real program and shows the same behaviour.
******************************************************** import numpy as N import scipy as S
def stress(): x = S.randn(200000,80) for i in range(8): print "%(0)d" % {"0": i} s = N.dot(x.T, x) sd = N.array([s.diagonal()]) r = N.dot(N.ones((N.size(x,0),1),'d'), sd) x = x + r x = x / 1.01
********************************************************
To different symptoms depending how big x are: 1) the program becomes extremely slow after a few iterations. 2) if the size of x is increased a little the program fails with the message "MemoryError" for example at line 'x = x + r', but different places in the code depending on the matrice size and which computer I'm testing. This might also occur after several iterations, not just during the first pass.
I'm using Windows XP, ActivePython 2.5.1.1, NumPy 1.0.4, SciPy 0.6.0.
 Is there an error under the hood in NumPy?  Am I balancing on the edge of the performance of Python/NumPy and should consider other environments. Fortran, C, BLAS, LAPACK e.t.c.  Am I misusing NumPy? Changing coding style will be a good workaround and even perform on larger datasets without errors?
Thanks in advance /Søren
On Wed, Dec 12, 2007 at 03:29:57PM +0100, Søren Dyrsting wrote:
I need to perform computations involving large arrays. A lot of rows and no more than e.g. 34 columns. My first choice is python/numpy because I'm already used to code in matlab.
However I'm experiencing memory problems even though there is still 500 MB available (2 GB total). I have cooked down my code to following meaningless code snip. This code share some of the same structure and calls as my real program and shows the same behaviour.
I would guess that this is due to memory fragmentation. Have you tried the same experiment under Linux? This article details some of the problems you may encounter under Windows:
http://www.ittvis.com/services/techtip.asp?ttid=3346
Regards Stéfan
On Dec 12, 2007 7:29 AM, Søren Dyrsting sorendyrsting@gmail.com wrote:
Hi all
I need to perform computations involving large arrays. A lot of rows and no more than e.g. 34 columns. My first choice is python/numpy because I'm already used to code in matlab.
However I'm experiencing memory problems even though there is still 500 MB available (2 GB total). I have cooked down my code to following meaningless code snip. This code share some of the same structure and calls as my real program and shows the same behaviour.
import numpy as N import scipy as S
def stress(): x = S.randn(200000,80) for i in range(8): print "%(0)d" % {"0": i} s = N.dot(x.T, x) sd = N.array([s.diagonal()]) r = N.dot(N.ones((N.size(x,0),1),'d'), sd) x = x + r x = x / 1.01
To different symptoms depending how big x are:
 the program becomes extremely slow after a few iterations.
This appears to be because you are overflowing your floating point variables. Once your data has INFs in it, it will tend to run much slower.
 if the size of x is increased a little the program fails with the
message "MemoryError" for example at line 'x = x + r', but different places in the code depending on the matrice size and which computer I'm testing. This might also occur after several iterations, not just during the first pass.
Why it would occur after several iterations I'm not sure. It's possible that there are some cycles that it takes a while for the garbage collector to get to and in the meantime you are chewing through all of your memory. Their are a couple different things you could try to address that, but before you do that, you need to clean up your algorithm and right it in idiomatic numpy. I realize that you said the above code is meaningless, but I'm going to assume that it's indicative of how your numpy code is written. That can be rewritten as:
def stress2(x): for i in range(8): print i x += (x**2).sum(axis=0) x /= 1.01 return x.sum()
Not only is the above about sixty times faster, it's considerably clearer as well. FWIW, on my box, which has a very similar setup to yours, neither version throws a memory error.
I'm using Windows XP, ActivePython 2.5.1.1, NumPy 1.0.4, SciPy 0.6.0.
 Is there an error under the hood in NumPy?
Probably not in this case.
 Am I balancing on the edge of the performance of Python/NumPy and should
consider other environments. Fortran, C, BLAS, LAPACK e.t.c.
Maybe, but try cleaning things up first.
 Am I misusing NumPy? Changing coding style will be a good workaround and
even perform on larger datasets without errors?
Your code is doing a lot of extra work and creating a lot of temporaries. I'd clean it up before I did anything else.
Thanks in advance /Søren
Numpydiscussion mailing list Numpydiscussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpydiscussion
participants (3)

Stefan van der Walt

Søren Dyrsting

Timothy Hochberg