[Numpy-discussion] Fixing issue of future opaqueness of ndarray this summer

mark florisson markflorisson88 at gmail.com
Tue May 15 06:44:55 EDT 2012


On 14 May 2012 21:54, David Cournapeau <cournape at gmail.com> wrote:
>
>
> On Mon, May 14, 2012 at 5:31 PM, mark florisson <markflorisson88 at gmail.com>
> wrote:
>>
>> On 12 May 2012 22:55, Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no>
>> wrote:
>> > On 05/11/2012 03:37 PM, mark florisson wrote:
>> >>
>> >> On 11 May 2012 12:13, Dag Sverre Seljebotn<d.s.seljebotn at astro.uio.no>
>> >>  wrote:
>> >>>
>> >>> (NumPy devs: I know, I get too many ideas. But this time I *really*
>> >>> believe
>> >>> in it, I think this is going to be *huge*. And if Mark F. likes it
>> >>> it's
>> >>> not
>> >>> going to be without manpower; and as his mentor I'd pitch in too here
>> >>> and
>> >>> there.)
>> >>>
>> >>> (Mark F.: I believe this is *very* relevant to your GSoC. I certainly
>> >>> don't
>> >>> want to micro-manage your GSoC, just have your take.)
>> >>>
>> >>> Travis, thank you very much for those good words in the "NA-mask
>> >>> interactions..." thread. It put most of my concerns away. If anybody
>> >>> is
>> >>> leaning towards for opaqueness because of its OOP purity, I want to
>> >>> refer
>> >>> to
>> >>> C++ and its walled-garden of ideological purity -- it has, what, 3-4
>> >>> different OOP array libraries, neither of which is able to out-compete
>> >>> the
>> >>> other. Meanwhile the rest of the world happily cooperates using
>> >>> pointers,
>> >>> strides, CSR and CSC.
>> >>>
>> >>> Now, there are limits to what you can do with strides and pointers.
>> >>> Noone's
>> >>> denying the need for more. In my mind that's an API where you can do
>> >>> fetch_block and put_block of cache-sized, N-dimensional blocks on an
>> >>> array;
>> >>> but it might be something slightly different.
>> >>>
>> >>> Here's what I'm asking: DO NOT simply keep extending ndarray and the
>> >>> NumPy C
>> >>> API to deal with this issue.
>> >>>
>> >>> What we need is duck-typing/polymorphism at the C level. If you keep
>> >>> extending ndarray and the NumPy C API, what we'll have is a
>> >>> one-to-many
>> >>> relationship: One provider of array technology, multiple consumers
>> >>> (with
>> >>> hooks, I'm sure, but all implementations of the hook concept in the
>> >>> NumPy
>> >>> world I've seen so far are a total disaster!).
>> >>>
>> >>> What I think we need instead is something like PEP 3118 for the
>> >>> "abstract"
>> >>> array that is only available block-wise with getters and setters. On
>> >>> the
>> >>> Cython list we've decided that what we want for CEP 1000 (for boxing
>> >>> callbacks etc.) is to extend PyTypeObject with our own fields; we
>> >>> could
>> >>> create CEP 1001 to solve this issue and make any Python object an
>> >>> exporter
>> >>> of "block-getter/setter-arrays" (better name needed).
>> >>>
>> >>> What would be exported is (of course) a simple vtable:
>> >>>
>> >>> typedef struct {
>> >>>    int (*get_block)(void *ctx, ssize_t *upper_left, ssize_t
>> >>> *lower_right,
>> >>> ...);
>> >>>    ...
>> >>> } block_getter_setter_array_vtable;
>> >>>
>> >>> Let's please discuss the details *after* the fundamentals. But the
>> >>> reason
>> >>> I
>> >>> put void* there instead of PyObject* is that I hope this could be used
>> >>> beyond the Python world (say, Python<->Julia); the void* would be
>> >>> handed
>> >>> to
>> >>> you at the time you receive the vtable (however we handle that).
>> >>
>> >>
>> >> I suppose it would also be useful to have some way of predicting the
>> >> output format polymorphically for the caller. E.g. dense *
>> >> block_diagonal results in block diagonal, but dense + block_diagonal
>> >> results in dense, etc. It might be useful for the caller to know
>> >> whether it needs to allocate a sparse, dense or block-structured
>> >> array. Or maybe the polymorphic function could even do the allocation.
>> >> This needs to happen recursively of course, to avoid intermediate
>> >> temporaries. The compiler could easily handle that, and so could numpy
>> >> when it gets lazy evaluation.
>> >
>> >
>> > Ah. But that depends too on the computation to be performed too; a)
>> > elementwise, b) axis-wise reductions, c) linear algebra...
>> >
>> > In my oomatrix code (please don't look at it, it's shameful) I do this
>> > using
>> > multiple dispatch.
>> >
>> > I'd rather ignore this for as long as we can, only implementing "a[:] =
>> > ..."
>> > -- I can't see how decisions here would trickle down to the API that's
>> > used
>> > in the kernel, it's more like a pre-phase, and better treated
>> > orthogonally.
>> >
>> >
>> >> I think if the heavy lifting of allocating output arrays and exporting
>> >> these arrays work in numpy, then support in Cython could use that (I
>> >> can already hear certain people object to more complicated array stuff
>> >> in Cython :). Even better here would be an external project that each
>> >> our projects could use (I still think the nditer sorting functionality
>> >> of arrays should be numpy-agnostic and externally available).
>> >
>> >
>> > I agree with the separate project idea. It's trivial for NumPy to
>> > incorporate that as one of its methods for exporting arrays, and I don't
>> > think it makes sense to either build it into Cython, or outright depend
>> > on
>> > NumPy.
>> >
>> > Here's what I'd like (working title: NumBridge?).
>> >
>> >  - Mission: Be the "double* + shape + strides" in a world where that is
>> > no
>> > longer enough, by providing tight, focused APIs/ABIs that are usable
>> > across
>> > C/Fortran/Python.
>> >
>> > I basically want something I can quickly acquire from a NumPy array,
>> > then
>> > pass it into my C code without dragging along all the cruft that I don't
>> > need.
>> >
>> >  - Written in pure C + specs, usable without Python
>> >
>> >  - PEP 3118 "done right", basically semi-standardize the internal Cython
>> > memoryview ABI and get something that's passable on stack
>> >
>> >  - Get block get/put API
>> >
>> >  - Iterator APIs
>> >
>> >  - Utility code for exporters and clients (iteration code, axis
>> > reordering,
>> > etc.)
>> >
>> > Is the scope of that insane, or is it at least worth a shot to see how
>> > bad
>> > it is? Beyond figuring out a small subset that can be done first, and
>> > whether performance considerations must be taken or not, there's two
>> > complicating factors: Pluggable dtypes, memory management. Perhaps you
>> > could
>> > come to Oslo for a couple of days to brainstorm...
>> >
>> > Dag
>>
>> The blocks are a good idea, but I think fairly complicated for
>> complicated matrix layouts. It would be nice to have something that is
>> reasonably efficient for at least most of the array storage
>> mechanisms.
>> I'm going to do a little brain dump below, let's see if anything is useful
>> :)
>>
>> What if we basically take the CSR format and make it a little simpler,
>> easier to handle, and better suited for other layouts. Basically, keep
>> shape/strides for dense arrays, and for blocked arrays just "extend"
>> your number of dimensions, i.e. a 2D blocked array becomes a 4D array,
>> something like this:
>>
>> >>> a = np.arange(4).repeat(4).reshape(4, 4);
>> >>> a
>> array([[0, 0, 0, 0],
>>           [1, 1, 1, 1],
>>           [2, 2, 2, 2],
>>           [3, 3, 3, 3]])
>>
>> >>> a.shape = (2, 2, 2, 2)
>> >>> itemsize = a.dtype.itemsize
>> >>> a.strides = (8 * itemsize, 2 * itemsize, 4 * itemsize, 1 * itemsize)
>> >>> a
>> array([[[[0, 0],
>>         [1, 1]],
>>
>>        [[0, 0],
>>         [1, 1]]],
>>
>>
>>       [[[2, 2],
>>         [3, 3]],
>>
>>        [[2, 2],
>>         [3, 3]]]])
>>
>> >>> print a.flatten()
>> [0 0 1 1 0 0 1 1 2 2 3 3 2 2 3 3]
>>
>> Now, given some buffer flag (PyBUF_Sparse or something), use basically
>> suboffsets with indirect dimensions, where only ever the last
>> dimension is a row of contiguous memory (the entire thing may be
>> contiguous, but the point is that you don't care). This row may
>>
>>    - be variable sized
>>    - have either a single "column start" (e.g. diagonal, band/tri-
>> diagonal, block diagonal, etc), OR
>>    - a list of column indices, the length of the row (like in the CSR
>> format)
>>
>> The length of each innermost row is then determined by looking at, in
>> order:
>>    - the extent as specified in the shape list
>>    - if -1, and some flag is set, the extent is determined like CSR,
>> i.e. (uintptr_t) row[i + 1] - (uintptr_t) row[i]
>>        -> maybe instead of pointers indices are better, for
>> serialization, GPUs, etc
>>    - otherwise, use either a function pointer or perhaps a list of extents
>>
>> All these details will obviously be abstracted, allowing for easy
>> iteration, but it can also be used by ufuncs operating on contiguous
>> rows (often, since the actual storage is contiguous and stored in a 1D
>> array, some flag could indicate contiguity as an optimization for
>> unary ufuncs and flat iteration). Tiled nditer-ation could also work
>> here without too big a hassle.
>> When you slice, you add to the suboffset and manipulate the single
>> extent or entire list of extents in that dimension, and the flag to
>> determine the length using the pointer subtraction should be cleared
>> (this should probably all happen through vtable functions).
>>
>> An exporter would also be free to use different malloced pointers,
>> allowing variable sized array support with append/pop functionality in
>> multiple dimensions (if there are no active buffer views).
>>
>> Random access in the case where a column start is provided is still
>> contant time, and done though a[i][j][k - colstart], unless you have
>> discontiguous rows, in which case you are allowed a logarithmic search
>> (if the size exceeds some threshold). I see scipy.sparse does a linear
>> search, which is pretty slow in pathological cases:
>>
>> from scipy import sparse
>> a = np.random.random((1, 4000000))
>> b = sparse.csr_matrix(a)
>> %timeit a[0, 1000]
>> %timeit b[0, 1000]
>>
>> 10000000 loops, best of 3: 190 ns per loop
>> 10 loops, best of 3: 29.3 ms per loop
>>
>> Now, depending on the density and operation, the caller may have some
>> idea of how to allocate an output array. I'm not sure how to handle
>> "insertions" of new elements, but I presume through vtable put/insert
>> functions. I'm also not sure how to fit this in with linear algebra
>> functionality, other than providing conversions of the view.
>>
>> I think a disadvantage of this scheme is that you can't reorder your
>> axes anymore, and many operations that are easy in the dense case are
>> suddenly harder, e.g. this scheme does not allow you to go from a
>> csr-like format into csc. But I think what this gives is reasonable
>> generality to allow easy use in C/Fortran/Cython compiled code/numpy
>> ufunc invocation, as well as allowing efficient-ish storage for
>> various kinds of arrays.
>>
>> Does this make any sense?
>
>
> I think it does, although it is not clear to me how it would generalize to
> more than 2 dimensions (i.e. how would you handle a 3d sparse array). Would
> you add more level of indirection ?

Yes exactly, it extends to an arbitrary number of dimensions. The
column start or column indices in my example were only in the last
dimension, but it pertains to each respective dimension.

> I have been thinking a bit about related concepts in the context of
> generalized sparse arrays (which could be a first step toward numpy arrays
> on top of multiple malloc blocks instead of just one), and one of the
> solution I have seen is generalized b-tree/b*-trees. The main appeal to me
> is that the underlying storage is still one-dimensional, and the "magic"
> happens "only" in the indexing. This has several advantages:
>
>   - getting the low level block algorithms rights is a solved problem
>   - it allows for growable arrays in O(log(N)) instead of O(N)
>   - one can hope to get near optimal performances in the common cases (1 and
> 2d) with degenerate indexing functions.
>
> Two of the references I have been looking at:
>   - UBTree: http://www.scholarpedia.org/article/B-tree_and_UB-tree
>   - Storing Matrices on Disk : Theory and Practice Revisited (by Zhang,
> Yi, Munagala, Kamesh and Yang, Jun)
>
> I have meant to implement some of those ideas and do some basic benchmarks
> (especially to compare against CSR and CSC in 2 dimensions, in which case
> one could imagine building the index function in such a way as range queries
> map to contiguous blocks of memory).

Interesting, thanks for the links.

> cheers,
>
> David
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>



More information about the NumPy-Discussion mailing list