[Numpy-discussion] Appending data to a big ndarray

Anne Archibald peridot.faceted at gmail.com
Sat Aug 9 17:28:07 EDT 2008


2008/8/8 oc-spam66 <oc-spam66 at laposte.net>:
> Hello,
>
> I would like to build a big ndarray by adding rows progressively.
>
> I considered the following functions : append, concatenate, vstack and the
> like.
> It appears to me that they all create a new array (which requires twice the
> memory).
>
> Is there a method for just adding a row to a ndarray without duplicating the
> data ?

Since ndarrays must be contiguous in memory, at the end of the day you
must have a contiguous block big  enough to contain the whole array.
If you can allocate this right away, life is easy. All you have to do
is np.zeros((guess_rows,columns)) and you get a big empty array. But
it's probably safe to assume that if you could predict how big the
array would be, you wouldn't have asked this question.

If you *don't* know how big the array needs to be, then you have
basically two options:
* use a list of row arrays while growing and construct the final array
at the end
* keep enlarging the array as needed

Keeping a list is actually my preferred solution; each row is
contiguous, but the array as a whole is not. But python lists expand
automatically, conveniently, and efficiently, and the data is not
copied. At the end, though, you need to copy all those rows into one
big array. While easy - just use np.array() - this does briefly
require twice the virtual memory, and could be slow.

Enlarging the array means, potentially, a slow copy every time it
grows. If you use the resize() method Travis suggested, it may occur
that these resizings can occur without copies. For small arrays, they
almost certainly will require copies, since the array will tend to
outgrow memory arenas. But, on Linux at least (and probably on other
modern OSes), large arrays live in chunks of virtual memory requested
from the system directly. If you are lucky and no other chunk of
memory is allocated immediately afterward, it should be possible to
enlarge this memory by just adding more pages onto the end. If not,
though, you get a big copy. You also, of course, have a trade-off
between using more space than you need and having (potentially) lots
of copies.

You can try this, but sadly, I think if your array is within a factor
of two of the size of available virtual memory, you're going to be
disappointed with numpy. Very many operations require a large
temporary array. I recommend just going with something simple and
living with a single big copy.

Anne



More information about the NumPy-Discussion mailing list