[Numpy-discussion] Fixing issue of future opaqueness of ndarray this summer

mark florisson markflorisson88 at gmail.com
Mon May 14 12:31:47 EDT 2012

On 12 May 2012 22:55, Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no> wrote:
> On 05/11/2012 03:37 PM, mark florisson wrote:
>> On 11 May 2012 12:13, Dag Sverre Seljebotn<d.s.seljebotn at astro.uio.no>
>>  wrote:
>>> (NumPy devs: I know, I get too many ideas. But this time I *really*
>>> believe
>>> in it, I think this is going to be *huge*. And if Mark F. likes it it's
>>> not
>>> going to be without manpower; and as his mentor I'd pitch in too here and
>>> there.)
>>> (Mark F.: I believe this is *very* relevant to your GSoC. I certainly
>>> don't
>>> want to micro-manage your GSoC, just have your take.)
>>> Travis, thank you very much for those good words in the "NA-mask
>>> interactions..." thread. It put most of my concerns away. If anybody is
>>> leaning towards for opaqueness because of its OOP purity, I want to refer
>>> to
>>> C++ and its walled-garden of ideological purity -- it has, what, 3-4
>>> different OOP array libraries, neither of which is able to out-compete
>>> the
>>> other. Meanwhile the rest of the world happily cooperates using pointers,
>>> strides, CSR and CSC.
>>> Now, there are limits to what you can do with strides and pointers.
>>> Noone's
>>> denying the need for more. In my mind that's an API where you can do
>>> fetch_block and put_block of cache-sized, N-dimensional blocks on an
>>> array;
>>> but it might be something slightly different.
>>> Here's what I'm asking: DO NOT simply keep extending ndarray and the
>>> NumPy C
>>> API to deal with this issue.
>>> What we need is duck-typing/polymorphism at the C level. If you keep
>>> extending ndarray and the NumPy C API, what we'll have is a one-to-many
>>> relationship: One provider of array technology, multiple consumers (with
>>> hooks, I'm sure, but all implementations of the hook concept in the NumPy
>>> world I've seen so far are a total disaster!).
>>> What I think we need instead is something like PEP 3118 for the
>>> "abstract"
>>> array that is only available block-wise with getters and setters. On the
>>> Cython list we've decided that what we want for CEP 1000 (for boxing
>>> callbacks etc.) is to extend PyTypeObject with our own fields; we could
>>> create CEP 1001 to solve this issue and make any Python object an
>>> exporter
>>> of "block-getter/setter-arrays" (better name needed).
>>> What would be exported is (of course) a simple vtable:
>>> typedef struct {
>>>    int (*get_block)(void *ctx, ssize_t *upper_left, ssize_t *lower_right,
>>> ...);
>>>    ...
>>> } block_getter_setter_array_vtable;
>>> Let's please discuss the details *after* the fundamentals. But the reason
>>> I
>>> put void* there instead of PyObject* is that I hope this could be used
>>> beyond the Python world (say, Python<->Julia); the void* would be handed
>>> to
>>> you at the time you receive the vtable (however we handle that).
>> I suppose it would also be useful to have some way of predicting the
>> output format polymorphically for the caller. E.g. dense *
>> block_diagonal results in block diagonal, but dense + block_diagonal
>> results in dense, etc. It might be useful for the caller to know
>> whether it needs to allocate a sparse, dense or block-structured
>> array. Or maybe the polymorphic function could even do the allocation.
>> This needs to happen recursively of course, to avoid intermediate
>> temporaries. The compiler could easily handle that, and so could numpy
>> when it gets lazy evaluation.
> Ah. But that depends too on the computation to be performed too; a)
> elementwise, b) axis-wise reductions, c) linear algebra...
> In my oomatrix code (please don't look at it, it's shameful) I do this using
> multiple dispatch.
> I'd rather ignore this for as long as we can, only implementing "a[:] = ..."
> -- I can't see how decisions here would trickle down to the API that's used
> in the kernel, it's more like a pre-phase, and better treated orthogonally.
>> I think if the heavy lifting of allocating output arrays and exporting
>> these arrays work in numpy, then support in Cython could use that (I
>> can already hear certain people object to more complicated array stuff
>> in Cython :). Even better here would be an external project that each
>> our projects could use (I still think the nditer sorting functionality
>> of arrays should be numpy-agnostic and externally available).
> I agree with the separate project idea. It's trivial for NumPy to
> incorporate that as one of its methods for exporting arrays, and I don't
> think it makes sense to either build it into Cython, or outright depend on
> NumPy.
> Here's what I'd like (working title: NumBridge?).
>  - Mission: Be the "double* + shape + strides" in a world where that is no
> longer enough, by providing tight, focused APIs/ABIs that are usable across
> C/Fortran/Python.
> I basically want something I can quickly acquire from a NumPy array, then
> pass it into my C code without dragging along all the cruft that I don't
> need.
>  - Written in pure C + specs, usable without Python
>  - PEP 3118 "done right", basically semi-standardize the internal Cython
> memoryview ABI and get something that's passable on stack
>  - Get block get/put API
>  - Iterator APIs
>  - Utility code for exporters and clients (iteration code, axis reordering,
> etc.)
> Is the scope of that insane, or is it at least worth a shot to see how bad
> it is? Beyond figuring out a small subset that can be done first, and
> whether performance considerations must be taken or not, there's two
> complicating factors: Pluggable dtypes, memory management. Perhaps you could
> come to Oslo for a couple of days to brainstorm...
> Dag

The blocks are a good idea, but I think fairly complicated for
complicated matrix layouts. It would be nice to have something that is
reasonably efficient for at least most of the array storage
I'm going to do a little brain dump below, let's see if anything is useful :)

What if we basically take the CSR format and make it a little simpler,
easier to handle, and better suited for other layouts. Basically, keep
shape/strides for dense arrays, and for blocked arrays just "extend"
your number of dimensions, i.e. a 2D blocked array becomes a 4D array,
something like this:

>>> a = np.arange(4).repeat(4).reshape(4, 4);
>>> a
array([[0, 0, 0, 0],
           [1, 1, 1, 1],
           [2, 2, 2, 2],
           [3, 3, 3, 3]])

>>> a.shape = (2, 2, 2, 2)
>>> itemsize = a.dtype.itemsize
>>> a.strides = (8 * itemsize, 2 * itemsize, 4 * itemsize, 1 * itemsize)
>>> a
array([[[[0, 0],
         [1, 1]],

        [[0, 0],
         [1, 1]]],

       [[[2, 2],
         [3, 3]],

        [[2, 2],
         [3, 3]]]])

>>> print a.flatten()
[0 0 1 1 0 0 1 1 2 2 3 3 2 2 3 3]

Now, given some buffer flag (PyBUF_Sparse or something), use basically
suboffsets with indirect dimensions, where only ever the last
dimension is a row of contiguous memory (the entire thing may be
contiguous, but the point is that you don't care). This row may

    - be variable sized
    - have either a single "column start" (e.g. diagonal, band/tri-
diagonal, block diagonal, etc), OR
    - a list of column indices, the length of the row (like in the CSR format)

The length of each innermost row is then determined by looking at, in order:
    - the extent as specified in the shape list
    - if -1, and some flag is set, the extent is determined like CSR,
i.e. (uintptr_t) row[i + 1] - (uintptr_t) row[i]
        -> maybe instead of pointers indices are better, for
serialization, GPUs, etc
    - otherwise, use either a function pointer or perhaps a list of extents

All these details will obviously be abstracted, allowing for easy
iteration, but it can also be used by ufuncs operating on contiguous
rows (often, since the actual storage is contiguous and stored in a 1D
array, some flag could indicate contiguity as an optimization for
unary ufuncs and flat iteration). Tiled nditer-ation could also work
here without too big a hassle.
When you slice, you add to the suboffset and manipulate the single
extent or entire list of extents in that dimension, and the flag to
determine the length using the pointer subtraction should be cleared
(this should probably all happen through vtable functions).

An exporter would also be free to use different malloced pointers,
allowing variable sized array support with append/pop functionality in
multiple dimensions (if there are no active buffer views).

Random access in the case where a column start is provided is still
contant time, and done though a[i][j][k - colstart], unless you have
discontiguous rows, in which case you are allowed a logarithmic search
(if the size exceeds some threshold). I see scipy.sparse does a linear
search, which is pretty slow in pathological cases:

from scipy import sparse
a = np.random.random((1, 4000000))
b = sparse.csr_matrix(a)
%timeit a[0, 1000]
%timeit b[0, 1000]

10000000 loops, best of 3: 190 ns per loop
10 loops, best of 3: 29.3 ms per loop

Now, depending on the density and operation, the caller may have some
idea of how to allocate an output array. I'm not sure how to handle
"insertions" of new elements, but I presume through vtable put/insert
functions. I'm also not sure how to fit this in with linear algebra
functionality, other than providing conversions of the view.

I think a disadvantage of this scheme is that you can't reorder your
axes anymore, and many operations that are easy in the dense case are
suddenly harder, e.g. this scheme does not allow you to go from a
csr-like format into csc. But I think what this gives is reasonable
generality to allow easy use in C/Fortran/Cython compiled code/numpy
ufunc invocation, as well as allowing efficient-ish storage for
various kinds of arrays.

Does this make any sense?

More information about the NumPy-Discussion mailing list