[Numpy-discussion] Fixing issue of future opaqueness of ndarray this summer

Mon May 14 16:36:09 EDT 2012

On 05/14/2012 06:31 PM, mark florisson wrote:
> On 12 May 2012 22:55, Dag Sverre Seljebotn<d.s.seljebotn at astro.uio.no>  wrote:
>> On 05/11/2012 03:37 PM, mark florisson wrote:
>>>
>>> On 11 May 2012 12:13, Dag Sverre Seljebotn<d.s.seljebotn at astro.uio.no>
>>>   wrote:
>>>>
>>>> (NumPy devs: I know, I get too many ideas. But this time I *really*
>>>> believe
>>>> in it, I think this is going to be *huge*. And if Mark F. likes it it's
>>>> not
>>>> going to be without manpower; and as his mentor I'd pitch in too here and
>>>> there.)
>>>>
>>>> (Mark F.: I believe this is *very* relevant to your GSoC. I certainly
>>>> don't
>>>> want to micro-manage your GSoC, just have your take.)
>>>>
>>>> Travis, thank you very much for those good words in the "NA-mask
>>>> interactions..." thread. It put most of my concerns away. If anybody is
>>>> leaning towards for opaqueness because of its OOP purity, I want to refer
>>>> to
>>>> C++ and its walled-garden of ideological purity -- it has, what, 3-4
>>>> different OOP array libraries, neither of which is able to out-compete
>>>> the
>>>> other. Meanwhile the rest of the world happily cooperates using pointers,
>>>> strides, CSR and CSC.
>>>>
>>>> Now, there are limits to what you can do with strides and pointers.
>>>> Noone's
>>>> denying the need for more. In my mind that's an API where you can do
>>>> fetch_block and put_block of cache-sized, N-dimensional blocks on an
>>>> array;
>>>> but it might be something slightly different.
>>>>
>>>> Here's what I'm asking: DO NOT simply keep extending ndarray and the
>>>> NumPy C
>>>> API to deal with this issue.
>>>>
>>>> What we need is duck-typing/polymorphism at the C level. If you keep
>>>> extending ndarray and the NumPy C API, what we'll have is a one-to-many
>>>> relationship: One provider of array technology, multiple consumers (with
>>>> hooks, I'm sure, but all implementations of the hook concept in the NumPy
>>>> world I've seen so far are a total disaster!).
>>>>
>>>> What I think we need instead is something like PEP 3118 for the
>>>> "abstract"
>>>> array that is only available block-wise with getters and setters. On the
>>>> Cython list we've decided that what we want for CEP 1000 (for boxing
>>>> callbacks etc.) is to extend PyTypeObject with our own fields; we could
>>>> create CEP 1001 to solve this issue and make any Python object an
>>>> exporter
>>>> of "block-getter/setter-arrays" (better name needed).
>>>>
>>>> What would be exported is (of course) a simple vtable:
>>>>
>>>> typedef struct {
>>>>     int (*get_block)(void *ctx, ssize_t *upper_left, ssize_t *lower_right,
>>>> ...);
>>>>     ...
>>>> } block_getter_setter_array_vtable;
>>>>
>>>> Let's please discuss the details *after* the fundamentals. But the reason
>>>> I
>>>> put void* there instead of PyObject* is that I hope this could be used
>>>> beyond the Python world (say, Python<->Julia); the void* would be handed
>>>> to
>>>> you at the time you receive the vtable (however we handle that).
>>>
>>>
>>> I suppose it would also be useful to have some way of predicting the
>>> output format polymorphically for the caller. E.g. dense *
>>> block_diagonal results in block diagonal, but dense + block_diagonal
>>> results in dense, etc. It might be useful for the caller to know
>>> whether it needs to allocate a sparse, dense or block-structured
>>> array. Or maybe the polymorphic function could even do the allocation.
>>> This needs to happen recursively of course, to avoid intermediate
>>> temporaries. The compiler could easily handle that, and so could numpy
>>> when it gets lazy evaluation.
>>
>>
>> Ah. But that depends too on the computation to be performed too; a)
>> elementwise, b) axis-wise reductions, c) linear algebra...
>>
>> In my oomatrix code (please don't look at it, it's shameful) I do this using
>> multiple dispatch.
>>
>> I'd rather ignore this for as long as we can, only implementing "a[:] = ..."
>> -- I can't see how decisions here would trickle down to the API that's used
>> in the kernel, it's more like a pre-phase, and better treated orthogonally.
>>
>>
>>> I think if the heavy lifting of allocating output arrays and exporting
>>> these arrays work in numpy, then support in Cython could use that (I
>>> can already hear certain people object to more complicated array stuff
>>> in Cython :). Even better here would be an external project that each
>>> our projects could use (I still think the nditer sorting functionality
>>> of arrays should be numpy-agnostic and externally available).
>>
>>
>> I agree with the separate project idea. It's trivial for NumPy to
>> incorporate that as one of its methods for exporting arrays, and I don't
>> think it makes sense to either build it into Cython, or outright depend on
>> NumPy.
>>
>> Here's what I'd like (working title: NumBridge?).
>>
>>   - Mission: Be the "double* + shape + strides" in a world where that is no
>> longer enough, by providing tight, focused APIs/ABIs that are usable across
>> C/Fortran/Python.
>>
>> I basically want something I can quickly acquire from a NumPy array, then
>> pass it into my C code without dragging along all the cruft that I don't
>> need.
>>
>>   - Written in pure C + specs, usable without Python
>>
>>   - PEP 3118 "done right", basically semi-standardize the internal Cython
>> memoryview ABI and get something that's passable on stack
>>
>>   - Get block get/put API
>>
>>   - Iterator APIs
>>
>>   - Utility code for exporters and clients (iteration code, axis reordering,
>> etc.)
>>
>> Is the scope of that insane, or is it at least worth a shot to see how bad
>> it is? Beyond figuring out a small subset that can be done first, and
>> whether performance considerations must be taken or not, there's two
>> complicating factors: Pluggable dtypes, memory management. Perhaps you could
>> come to Oslo for a couple of days to brainstorm...
>>
>> Dag
>
> The blocks are a good idea, but I think fairly complicated for
> complicated matrix layouts. It would be nice to have something that is
> reasonably efficient for at least most of the array storage
> mechanisms.
> I'm going to do a little brain dump below, let's see if anything is useful :)
>
> What if we basically take the CSR format and make it a little simpler,
> easier to handle, and better suited for other layouts. Basically, keep
> shape/strides for dense arrays, and for blocked arrays just "extend"
> your number of dimensions, i.e. a 2D blocked array becomes a 4D array,
> something like this:
>
>>>> a = np.arange(4).repeat(4).reshape(4, 4);
>>>> a
> array([[0, 0, 0, 0],
>             [1, 1, 1, 1],
>             [2, 2, 2, 2],
>             [3, 3, 3, 3]])
>
>>>> a.shape = (2, 2, 2, 2)
>>>> itemsize = a.dtype.itemsize
>>>> a.strides = (8 * itemsize, 2 * itemsize, 4 * itemsize, 1 * itemsize)
>>>> a
> array([[[[0, 0],
>           [1, 1]],
>
>          [[0, 0],
>           [1, 1]]],
>
>
>         [[[2, 2],
>           [3, 3]],
>
>          [[2, 2],
>           [3, 3]]]])
>
>>>> print a.flatten()
> [0 0 1 1 0 0 1 1 2 2 3 3 2 2 3 3]
>
> Now, given some buffer flag (PyBUF_Sparse or something), use basically
> suboffsets with indirect dimensions, where only ever the last
> dimension is a row of contiguous memory (the entire thing may be
> contiguous, but the point is that you don't care). This row may
>
>      - be variable sized
>      - have either a single "column start" (e.g. diagonal, band/tri-
> diagonal, block diagonal, etc), OR
>      - a list of column indices, the length of the row (like in the CSR format)
>
> The length of each innermost row is then determined by looking at, in order:
>      - the extent as specified in the shape list
>      - if -1, and some flag is set, the extent is determined like CSR,
> i.e. (uintptr_t) row[i + 1] - (uintptr_t) row[i]
>          ->  maybe instead of pointers indices are better, for
> serialization, GPUs, etc
>      - otherwise, use either a function pointer or perhaps a list of extents
>
> All these details will obviously be abstracted, allowing for easy
> iteration, but it can also be used by ufuncs operating on contiguous
> rows (often, since the actual storage is contiguous and stored in a 1D
> array, some flag could indicate contiguity as an optimization for
> unary ufuncs and flat iteration). Tiled nditer-ation could also work
> here without too big a hassle.
> When you slice, you add to the suboffset and manipulate the single
> extent or entire list of extents in that dimension, and the flag to
> determine the length using the pointer subtraction should be cleared
> (this should probably all happen through vtable functions).
>
> An exporter would also be free to use different malloced pointers,
> allowing variable sized array support with append/pop functionality in
> multiple dimensions (if there are no active buffer views).
>
> Random access in the case where a column start is provided is still
> contant time, and done though a[i][j][k - colstart], unless you have
> discontiguous rows, in which case you are allowed a logarithmic search
> (if the size exceeds some threshold). I see scipy.sparse does a linear
> search, which is pretty slow in pathological cases:
>
> from scipy import sparse
> a = np.random.random((1, 4000000))
> b = sparse.csr_matrix(a)
> %timeit a[0, 1000]
> %timeit b[0, 1000]
>
> 10000000 loops, best of 3: 190 ns per loop
> 10 loops, best of 3: 29.3 ms per loop

Heh. That is *extremely* pathological though, nobody does that in real 
code :-)

Here's an idea I had yesterday: To get an ND sparse array with good 
spatial locality (for local operations etc.), you could map the elements 
to a volume-filling fractal and then use a hash-map with linear probing. 
I bet it either doesn't work or has been done already :-)

>
> Now, depending on the density and operation, the caller may have some
> idea of how to allocate an output array. I'm not sure how to handle
> "insertions" of new elements, but I presume through vtable put/insert
> functions. I'm also not sure how to fit this in with linear algebra
> functionality, other than providing conversions of the view.
>
> I think a disadvantage of this scheme is that you can't reorder your
> axes anymore, and many operations that are easy in the dense case are
> suddenly harder, e.g. this scheme does not allow you to go from a
> csr-like format into csc. But I think what this gives is reasonable
> generality to allow easy use in C/Fortran/Cython compiled code/numpy
> ufunc invocation, as well as allowing efficient-ish storage for
> various kinds of arrays.
>
> Does this make any sense?

I'll admit I didn't go through the finer details of your idea; let's 
deal with the below first and then I can re-read your post later.

What I'm thinking is that this seems interesting, but perhaps it lowers 
the versatility so much that it's not really worth to consider, for the 
GSoC at least. If the impact isn't high enough, my hunch is that one may 
as well not bother and just do strided arrays.

I actually believe that the *likely* outcome of this discussion is to 
stick to the original plan and focus on expressions with strided arrays. 
But let's see.

I'm not sure if what brought me to the blocked case was really 
block-sparse arrays or diagonal arrays. Let's see...:

1) Memory conservation. The array would not be NOT element-sparse, it's 
just that you don't want to hold it all in memory at once.

Examples:

  - "a_array *= np.random.normal(size=a_array.shape)". The right hand 
side can be generated on the fly (making it return the same data for 
each block every time is non-trivial but can be done). If a takes up a 
good chunk of RAM, there might not even be enough memory for the 
right-hand-side except for block-by-block. (That saves bandwidth too.)

  - You want to stream from one file to another file, neither of which 
will fit in RAM. Here we really don't care about speed, just code 
reuse...it's irritating to have to manually block in such a situation.

  - You want to play with a new fancy array format (a blocked format 
that's faster for some linear algebra, say). But then you need to call 
another C library that takes data+shape+stride+dtype. Then you need to 
make a copy, which you'd rather avoid -- so an alternative is to make 
that C library be based on the block API so that it can interfaces with 
your fancy new format (and anything else).

2) Bandwidth conservation. Numexpr, Theano, Numba and 
Cython-vectorized-expressions all already deal with this problem on one 
level.

However, I also believe there's a higher level in many programs where 
blocks come into play. The organization of many scientific codes 
essentially reads "do A on 8 GB of data, then do B on 8 GB of data"; and 
it's going to be a *long* time before you have full-program analysis to 
block up that in every case; the tools above will be able to deal with 
some almost-trivial cases only.

A block API could be used to write "pipeline programs". For instance, 
consider

a_array[:, None] * f(x_array)

and f is some rather complicated routine in Fortran that is NOT a ufunc 
-- it takes all of x_array as input and provides all the output "at 
once"; but at some point it needs to do the writing of the output, and 
if writing to an API it could do the multiplication with a_array while 
the block is in cache anyway and save a memory bus round-trip. (Provided 
the output isn't needed as scratch space.)

Summing up:

Vectorized expressions on contiguous (or strided) memory in Cython is 
pretty nice by itself; it would bring us up to the current state of the 
art of static compilation (in Fortran compilers).

But my hunch is your sparse-block proposal doesn't add enough 
flexibility to be worth the pain. Many of the cases above can't be 
covered with it.

If it's possible to identify a little nugget API that is 
forward-compatible with the above usecases (it wouldn't solve them 
perhaps, but allow them to be solved with some extra supporting code), 
then it might be worth to a) implement it in NumPy, b) support it for 
Cython vectorized expressions; and use that to support block-transposition.

But I'm starting to feel that we can't really know how that little 
nugget API should really look until the above has been explored in a 
little more depth; and then we are easily talking a too large scope 
without tying it to a larger project (which can't really before the 
GSoC..). For instance, 2) suggests push/pull rather than put/get.

Dag