[Numpy-discussion] finding close together points.

Christopher Barker Chris.Barker at noaa.gov
Thu Nov 12 12:32:50 EST 2009


David Cournapeau wrote:

> I would love having a core C library of containers - 

I'm all for that. However, I think that a very, very common need is 
simply for a growable numpy array.

It seems this would actually be pretty darn easy (again, for someone 
familiar with the code!).

IIUC, it would be exactly like a regular old ndarray, except:

  1) The data block could be larger than the current size of the array:
     - this would require an additional attribute to carry that size

  2) There would be an "append" method, or some other name:
     - this would increase the size of the array, manipulating 
.dimensions, and, if needed, doing a reallocation of the data pointer.
     - This is not very complex code, really, you simply allocate more 
than required, choosing some (user-resettable?) scale: 25% more than 
before, or whatever. re-allocating memory is actually pretty cheap these 
days, so it's not all that critical how you do it.

  3) we'd probably want a .fit() method that would resize the data block 
to fit the current size of the array.

In fact, as I write this, I wonder if these features could simply be 
added to the regular old ndarray -- if not used, they wouldn't be much 
overhead.

Issues:

  - views on the same data block: as the data pointer would need to be 
re-allocated, you couldn't have views on the data block. Maybe this 
could be handles in the same way that ndarray.resize is handles now: 
you'd get an error if you try to resize an array that has a view to its 
data block. This is a bit tricky, as it's really easy to create a view 
without thinking about it (like by using ipython, for instance)

  - You could only grow the array in the first dimension (C-order, 
anyway). I think it could still be very useful, handling many of the 
common cases.

  - if you append to an array of more than one dimension, do you need to 
give the entire appended slice? Again, restrictive, but still useful.

  - Others? -- I'm not thinking of any right now.


David Cournapeau wrote:
> If you need to manipulate
> core datatypes (float, etc...), I think python lists are a significant
> cost, because of pointer chasing in particular.

Exactly, and add custom numpy dtypes as well. Putting python types 
and/or tuples of python types into a list can be far more memory 
intensive than putting them in a numpy array.

That's why I built a python implementation of a growable array -- it 
does save memory, but it's a fair bit slower than lists -- when adding 
to it in python, you've got python types already, so the additional 
overhead of an extra function call, etc. is substantial. That's why I'd 
like a C version.

Even more important is having one that is not only written in C (or 
Cython) but can be accessed directly from C (or Cython), so you can fill 
an array with core data types efficiently. Going from C-type to python 
object to put it in a list is substantial overhead.

> Stefan used some tricks to avoid using python list and use basic C
> structures in some custom code. 

Exactly -- and he's not he only one -- we're all re-inventing the wheel 
here.

By the way -- what does the fromstring/fromfile code do when you don't 
give it a size to start with?


-Chris








-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov



More information about the NumPy-Discussion mailing list