[Numpy-discussion] numpy large arrays?

Wed Dec 12 14:40:24 EST 2007

On Dec 12, 2007 7:29 AM, Søren Dyrsting <sorendyrsting at gmail.com> wrote:

> Hi all
>
> I need to perform computations involving large arrays. A lot of rows and
> no more than e.g. 34 columns. My first choice is python/numpy because I'm
> already used to code in matlab.
>
> However I'm experiencing memory problems even though there is still 500 MB
> available (2 GB total). I have cooked down my code to following meaningless
> code snip. This code share some of the same structure and calls as my real
> program and shows the same behaviour.
>
> ********************************************************
> import numpy as N
> import scipy as S
>
> def stress():
>     x = S.randn(200000,80)
>     for i in range(8):
>         print "%(0)d" % {"0": i}
>         s = N.dot(x.T, x)
>         sd = N.array([s.diagonal()])
>         r = N.dot(N.ones((N.size(x,0),1),'d'), sd)
>         x = x + r
>         x = x / 1.01
>
> ********************************************************
>
>
> To different symptoms depending how big x are:
> 1) the program becomes extremely slow after a few iterations.

This appears to be because you are overflowing your floating point
variables. Once your data has INFs in it, it will tend to run much slower.

>
> 2) if the size of x is increased a little the program fails with the
> message "MemoryError" for example at line 'x = x + r', but different places
> in the code depending on the matrice size and which computer I'm testing.
> This might also occur after several iterations, not just during the first
> pass.

Why it would occur after several iterations I'm not sure. It's possible that
there are some cycles that it takes a while for the garbage collector to get
to and in the meantime you are chewing through all of your memory. Their are
a couple different things you could try to address that, but before you do
that, you need to clean up your algorithm and right it in idiomatic numpy. I
realize that you said the above code is meaningless, but I'm going to assume
that it's indicative of how your numpy code is written. That can be
rewritten as:

    def stress2(x):
        for i in range(8):
            print i
            x += (x**2).sum(axis=0)
            x /= 1.01
        return x.sum()

Not only is the above about sixty times faster, it's considerably clearer as
well. FWIW, on my box, which has a very similar setup to yours, neither
version throws a memory error.

>
> I'm using Windows XP, ActivePython 2.5.1.1, NumPy 1.0.4, SciPy  0.6.0.
>
> - Is there an error under the hood in NumPy?

Probably not in this case.

>
> - Am I balancing on the edge of the performance of Python/NumPy and should
> consider other environments. Fortran, C, BLAS, LAPACK e.t.c.

Maybe, but try cleaning things up first.

>
> - Am I misusing NumPy? Changing coding style will be a good workaround and
> even perform on larger datasets without errors?

Your code is doing a lot of extra work and creating a lot of temporaries.
I'd clean it up before I did anything else.

>
>
> Thanks in advance
> /Søren
>
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>

-- 
.  __
.   |-\
.
.  tim.hochberg at ieee.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20071212/31f51121/attachment.html>