[Numpy-discussion] Dynamic array list implementation

Chris Barker chris.barker at noaa.gov
Thu Dec 24 13:19:32 EST 2015


On Wed, Dec 23, 2015 at 4:01 AM, Nicolas P. Rougier <
Nicolas.Rougier at inria.fr> wrote:
>
> Typed list in numpy would be a nice addition indeed and your cython
> implementation is nice (and small).
>

It seems we have a lot of duplicated effort here. Pernonally, I have two
needs:

1) ragged arrays
2) "growable" arrays.

I have semi-complete version of both of these, which are completely
independent -- not sure if it makes sense to combine them, I suppose not.

But we've talked a bit about  "typed list", and I'm not sure what that
means -- is it something that is entirely like a python list, except that
all the elements have the same type?

Anyway: I've been thinking about this fromt eh opposite direction: I want a
numpy array that you can append/extend. This comes from the fact that it's
not uncommon to need to build up an array where you don't know how large it
will be when you start. The common recommendation for doing that now is to
built it up in a list, and then, when you are done, turn it into an ndarray.

But that means you are limited to python types (or putting numpy scalars in
a list...), and it's not very memory efficient.

My version used a ndarray internally, and over allocates it a bit, using
ndarray.resize() to resize. this means that you can get the data pointer if
you want for Cython, etc... but also that it's getting re-allocated, so
that pointer is fragile, and you don't want other arrays to have views on
it.

Interestingly, if you are adding one float, for example, at a time to the
array, it's actually a bit faster to build it up in a list, and then make
an array out of it.

But it is more memory efficient and faster if you are using numpy dtypes
and especially if you are extend()ing it with chunks from other arrays.

I also have a not-quite finished version in Cython that statically handles
the core C data types -- that should be faster, but I haven't really
profiled it.

I'll try to get the code up on gitHub.

It would be nice to combine efforts.

-CHB
















> In my case I need to ensure a contiguous storage to allow easy upload onto
> the GPU.
> But my implementation is quite slow, especially when you add one item at a
> time:
>
> >>> python benchmark.py
> Python list, append 100000 items: 0.01161
> Array list, append 100000 items: 0.46854
> Array list, append 100000 items at once: 0.05801
> Python list, prepend 100000 items: 1.96168
> Array list, prepend 100000 items: 12.83371
> Array list, append 100000 items at once: 0.06002
>
>
>
> I realize I did not answer all Chris' questions:
>
> >>> L = ArrayList( [[0], [1,2], [3,4,5], [6,7,8,9]] )
> >>> for item in L: print(item)
> [0]
> [1 2]
> [3 4 5]
> [6 7 8 9]
>
> >>> print (type(L.data))
> <class 'numpy.ndarray'>
> >>> print(L.data.dtype)
> int64
> >>> print(L.data.shape)
> (10,)
>
>
> I did not implement operations yet, but it would be a matter for
> transferring call to the underlying numpy data array.
> >>> L._data *= 2
> >>> print(L)
> [[0], [4 8], [12 16 20], [24 28 32 36]]
>
>
>
> > On 23 Dec 2015, at 09:34, Stephan Hoyer <shoyer at gmail.com> wrote:
> >
> > We have a type similar to this (a typed list) internally in pandas,
> although it is restricted to a single dimension and far from feature
> complete -- it only has .append and a .to_array() method for converting to
> a 1d numpy array. Our version is written in Cython, and we use it for
> performance reasons when we would otherwise need to create a list of
> unknown length:
> > https://github.com/pydata/pandas/blob/v0.17.1/pandas/hashtable.pyx#L99
> >
> > In my experience, it's several times faster than using a builtin list
> from Cython, which makes sense given that it needs to copy about 1/3 the
> data (no type or reference count for individual elements). Obviously, it
> uses 1/3 the space to store the data, too. We currently don't expose this
> object externally, but it could be an interesting project to adapt this
> code into a standalone project that could be more broadly useful.
> >
> > Cheers,
> > Stephan
> >
> >
> >
> > On Tue, Dec 22, 2015 at 8:20 PM, Chris Barker <chris.barker at noaa.gov>
> wrote:
> >
> > sorry for being so lazy as to not go look at the project pages, but....
> >
> > This sounds like it could be really useful, and maybe supercise a coupl
> eof half-baked projects of mine. But -- what does "dynamic" mean?
> >
> > - can you append to these arrays?
> > - can it support "ragged arrrays" -- it looks like it does.
> >
> > >>> L = ArrayList( [[0], [1,2], [3,4,5], [6,7,8,9]] )
> > >>> print(L)
> > [[0], [1 2], [3 4 5], [6 7 8 9]]
> >
> > so this looks like a ragged array -- but what do you get when you do:
> >
> > for row in L:
> >     print row
> >
> >
> > >>> print(L.data)
> > [0 1 2 3 4 5 6 7 8
> >
> > is .data a regular old 1-d numpy array?
> >
> > >>> L = ArrayList( np.arange(10), [3,3,4])
> > >>> print(L)
> > [[0 1 2], [3 4 5], [6 7 8 9]]
> > >>> print(L.data)
> > [0 1 2 3 4 5 6 7 8 9]
> >
> >
> > does an ArrayList act like a numpy array in other ways:
> >
> > L * 5
> >
> > L* some_array
> >
> > in which case, how does it do broadcasting???
> >
> > Thanks,
> >
> > -CHB
> >
> > >>> L = ArrayList(["Hello", "world", "!"])
> > >>> print(L[0])
> > 'Hello'
> > >>> L[1] = "brave new world"
> > >>> print(L)
> > ['Hello', 'brave new world', '!']
> >
> >
> >
> > Nicolas
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org
> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> >
> >
> > --
> >
> > Christopher Barker, Ph.D.
> > Oceanographer
> >
> > Emergency Response Division
> > NOAA/NOS/OR&R            (206) 526-6959   voice
> > 7600 Sand Point Way NE   (206) 526-6329   fax
> > Seattle, WA  98115       (206) 526-6317   main reception
> >
> > Chris.Barker at noaa.gov
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org
> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20151224/1ffa417d/attachment.html>


More information about the NumPy-Discussion mailing list