[Numpy-discussion] Fixing issue of future opaqueness of ndarray this summer

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Mon May 14 16:39:50 EDT 2012


On 05/14/2012 10:36 PM, Dag Sverre Seljebotn wrote:
> On 05/14/2012 06:31 PM, mark florisson wrote:
>> On 12 May 2012 22:55, Dag Sverre Seljebotn<d.s.seljebotn at astro.uio.no>   wrote:
>>> On 05/11/2012 03:37 PM, mark florisson wrote:
>>>>
>>>> On 11 May 2012 12:13, Dag Sverre Seljebotn<d.s.seljebotn at astro.uio.no>
>>>>    wrote:
>>>>>
>>>>> (NumPy devs: I know, I get too many ideas. But this time I *really*
>>>>> believe
>>>>> in it, I think this is going to be *huge*. And if Mark F. likes it it's
>>>>> not
>>>>> going to be without manpower; and as his mentor I'd pitch in too here and
>>>>> there.)
>>>>>
>>>>> (Mark F.: I believe this is *very* relevant to your GSoC. I certainly
>>>>> don't
>>>>> want to micro-manage your GSoC, just have your take.)
>>>>>
>>>>> Travis, thank you very much for those good words in the "NA-mask
>>>>> interactions..." thread. It put most of my concerns away. If anybody is
>>>>> leaning towards for opaqueness because of its OOP purity, I want to refer
>>>>> to
>>>>> C++ and its walled-garden of ideological purity -- it has, what, 3-4
>>>>> different OOP array libraries, neither of which is able to out-compete
>>>>> the
>>>>> other. Meanwhile the rest of the world happily cooperates using pointers,
>>>>> strides, CSR and CSC.
>>>>>
>>>>> Now, there are limits to what you can do with strides and pointers.
>>>>> Noone's
>>>>> denying the need for more. In my mind that's an API where you can do
>>>>> fetch_block and put_block of cache-sized, N-dimensional blocks on an
>>>>> array;
>>>>> but it might be something slightly different.
>>>>>
>>>>> Here's what I'm asking: DO NOT simply keep extending ndarray and the
>>>>> NumPy C
>>>>> API to deal with this issue.
>>>>>
>>>>> What we need is duck-typing/polymorphism at the C level. If you keep
>>>>> extending ndarray and the NumPy C API, what we'll have is a one-to-many
>>>>> relationship: One provider of array technology, multiple consumers (with
>>>>> hooks, I'm sure, but all implementations of the hook concept in the NumPy
>>>>> world I've seen so far are a total disaster!).
>>>>>
>>>>> What I think we need instead is something like PEP 3118 for the
>>>>> "abstract"
>>>>> array that is only available block-wise with getters and setters. On the
>>>>> Cython list we've decided that what we want for CEP 1000 (for boxing
>>>>> callbacks etc.) is to extend PyTypeObject with our own fields; we could
>>>>> create CEP 1001 to solve this issue and make any Python object an
>>>>> exporter
>>>>> of "block-getter/setter-arrays" (better name needed).
>>>>>
>>>>> What would be exported is (of course) a simple vtable:
>>>>>
>>>>> typedef struct {
>>>>>      int (*get_block)(void *ctx, ssize_t *upper_left, ssize_t *lower_right,
>>>>> ...);
>>>>>      ...
>>>>> } block_getter_setter_array_vtable;
>>>>>
>>>>> Let's please discuss the details *after* the fundamentals. But the reason
>>>>> I
>>>>> put void* there instead of PyObject* is that I hope this could be used
>>>>> beyond the Python world (say, Python<->Julia); the void* would be handed
>>>>> to
>>>>> you at the time you receive the vtable (however we handle that).
>>>>
>>>>
>>>> I suppose it would also be useful to have some way of predicting the
>>>> output format polymorphically for the caller. E.g. dense *
>>>> block_diagonal results in block diagonal, but dense + block_diagonal
>>>> results in dense, etc. It might be useful for the caller to know
>>>> whether it needs to allocate a sparse, dense or block-structured
>>>> array. Or maybe the polymorphic function could even do the allocation.
>>>> This needs to happen recursively of course, to avoid intermediate
>>>> temporaries. The compiler could easily handle that, and so could numpy
>>>> when it gets lazy evaluation.
>>>
>>>
>>> Ah. But that depends too on the computation to be performed too; a)
>>> elementwise, b) axis-wise reductions, c) linear algebra...
>>>
>>> In my oomatrix code (please don't look at it, it's shameful) I do this using
>>> multiple dispatch.
>>>
>>> I'd rather ignore this for as long as we can, only implementing "a[:] = ..."
>>> -- I can't see how decisions here would trickle down to the API that's used
>>> in the kernel, it's more like a pre-phase, and better treated orthogonally.
>>>
>>>
>>>> I think if the heavy lifting of allocating output arrays and exporting
>>>> these arrays work in numpy, then support in Cython could use that (I
>>>> can already hear certain people object to more complicated array stuff
>>>> in Cython :). Even better here would be an external project that each
>>>> our projects could use (I still think the nditer sorting functionality
>>>> of arrays should be numpy-agnostic and externally available).
>>>
>>>
>>> I agree with the separate project idea. It's trivial for NumPy to
>>> incorporate that as one of its methods for exporting arrays, and I don't
>>> think it makes sense to either build it into Cython, or outright depend on
>>> NumPy.
>>>
>>> Here's what I'd like (working title: NumBridge?).
>>>
>>>    - Mission: Be the "double* + shape + strides" in a world where that is no
>>> longer enough, by providing tight, focused APIs/ABIs that are usable across
>>> C/Fortran/Python.
>>>
>>> I basically want something I can quickly acquire from a NumPy array, then
>>> pass it into my C code without dragging along all the cruft that I don't
>>> need.
>>>
>>>    - Written in pure C + specs, usable without Python
>>>
>>>    - PEP 3118 "done right", basically semi-standardize the internal Cython
>>> memoryview ABI and get something that's passable on stack
>>>
>>>    - Get block get/put API
>>>
>>>    - Iterator APIs
>>>
>>>    - Utility code for exporters and clients (iteration code, axis reordering,
>>> etc.)
>>>
>>> Is the scope of that insane, or is it at least worth a shot to see how bad
>>> it is? Beyond figuring out a small subset that can be done first, and
>>> whether performance considerations must be taken or not, there's two
>>> complicating factors: Pluggable dtypes, memory management. Perhaps you could
>>> come to Oslo for a couple of days to brainstorm...
>>>
>>> Dag
>>
>> The blocks are a good idea, but I think fairly complicated for
>> complicated matrix layouts. It would be nice to have something that is
>> reasonably efficient for at least most of the array storage
>> mechanisms.
>> I'm going to do a little brain dump below, let's see if anything is useful :)
>>
>> What if we basically take the CSR format and make it a little simpler,
>> easier to handle, and better suited for other layouts. Basically, keep
>> shape/strides for dense arrays, and for blocked arrays just "extend"
>> your number of dimensions, i.e. a 2D blocked array becomes a 4D array,
>> something like this:
>>
>>>>> a = np.arange(4).repeat(4).reshape(4, 4);
>>>>> a
>> array([[0, 0, 0, 0],
>>              [1, 1, 1, 1],
>>              [2, 2, 2, 2],
>>              [3, 3, 3, 3]])
>>
>>>>> a.shape = (2, 2, 2, 2)
>>>>> itemsize = a.dtype.itemsize
>>>>> a.strides = (8 * itemsize, 2 * itemsize, 4 * itemsize, 1 * itemsize)
>>>>> a
>> array([[[[0, 0],
>>            [1, 1]],
>>
>>           [[0, 0],
>>            [1, 1]]],
>>
>>
>>          [[[2, 2],
>>            [3, 3]],
>>
>>           [[2, 2],
>>            [3, 3]]]])
>>
>>>>> print a.flatten()
>> [0 0 1 1 0 0 1 1 2 2 3 3 2 2 3 3]
>>
>> Now, given some buffer flag (PyBUF_Sparse or something), use basically
>> suboffsets with indirect dimensions, where only ever the last
>> dimension is a row of contiguous memory (the entire thing may be
>> contiguous, but the point is that you don't care). This row may
>>
>>       - be variable sized
>>       - have either a single "column start" (e.g. diagonal, band/tri-
>> diagonal, block diagonal, etc), OR
>>       - a list of column indices, the length of the row (like in the CSR format)
>>
>> The length of each innermost row is then determined by looking at, in order:
>>       - the extent as specified in the shape list
>>       - if -1, and some flag is set, the extent is determined like CSR,
>> i.e. (uintptr_t) row[i + 1] - (uintptr_t) row[i]
>>           ->   maybe instead of pointers indices are better, for
>> serialization, GPUs, etc
>>       - otherwise, use either a function pointer or perhaps a list of extents
>>
>> All these details will obviously be abstracted, allowing for easy
>> iteration, but it can also be used by ufuncs operating on contiguous
>> rows (often, since the actual storage is contiguous and stored in a 1D
>> array, some flag could indicate contiguity as an optimization for
>> unary ufuncs and flat iteration). Tiled nditer-ation could also work
>> here without too big a hassle.
>> When you slice, you add to the suboffset and manipulate the single
>> extent or entire list of extents in that dimension, and the flag to
>> determine the length using the pointer subtraction should be cleared
>> (this should probably all happen through vtable functions).
>>
>> An exporter would also be free to use different malloced pointers,
>> allowing variable sized array support with append/pop functionality in
>> multiple dimensions (if there are no active buffer views).
>>
>> Random access in the case where a column start is provided is still
>> contant time, and done though a[i][j][k - colstart], unless you have
>> discontiguous rows, in which case you are allowed a logarithmic search
>> (if the size exceeds some threshold). I see scipy.sparse does a linear
>> search, which is pretty slow in pathological cases:
>>
>> from scipy import sparse
>> a = np.random.random((1, 4000000))
>> b = sparse.csr_matrix(a)
>> %timeit a[0, 1000]
>> %timeit b[0, 1000]
>>
>> 10000000 loops, best of 3: 190 ns per loop
>> 10 loops, best of 3: 29.3 ms per loop
>
> Heh. That is *extremely* pathological though, nobody does that in real
> code :-)
>
> Here's an idea I had yesterday: To get an ND sparse array with good
> spatial locality (for local operations etc.), you could map the elements
> to a volume-filling fractal and then use a hash-map with linear probing.
> I bet it either doesn't work or has been done already :-)
>
>>
>> Now, depending on the density and operation, the caller may have some
>> idea of how to allocate an output array. I'm not sure how to handle
>> "insertions" of new elements, but I presume through vtable put/insert
>> functions. I'm also not sure how to fit this in with linear algebra
>> functionality, other than providing conversions of the view.
>>
>> I think a disadvantage of this scheme is that you can't reorder your
>> axes anymore, and many operations that are easy in the dense case are
>> suddenly harder, e.g. this scheme does not allow you to go from a
>> csr-like format into csc. But I think what this gives is reasonable
>> generality to allow easy use in C/Fortran/Cython compiled code/numpy
>> ufunc invocation, as well as allowing efficient-ish storage for
>> various kinds of arrays.
>>
>> Does this make any sense?
>
> I'll admit I didn't go through the finer details of your idea; let's
> deal with the below first and then I can re-read your post later.
>
> What I'm thinking is that this seems interesting, but perhaps it lowers
> the versatility so much that it's not really worth to consider, for the
> GSoC at least. If the impact isn't high enough, my hunch is that one may
> as well not bother and just do strided arrays.
>
> I actually believe that the *likely* outcome of this discussion is to
> stick to the original plan and focus on expressions with strided arrays.
> But let's see.
>
> I'm not sure if what brought me to the blocked case was really
> block-sparse arrays or diagonal arrays. Let's see...:
>
> 1) Memory conservation. The array would not be NOT element-sparse, it's
> just that you don't want to hold it all in memory at once.
>
> Examples:
>
>    - "a_array *= np.random.normal(size=a_array.shape)". The right hand
> side can be generated on the fly (making it return the same data for
> each block every time is non-trivial but can be done). If a takes up a
> good chunk of RAM, there might not even be enough memory for the
> right-hand-side except for block-by-block. (That saves bandwidth too.)
>
>    - You want to stream from one file to another file, neither of which
> will fit in RAM. Here we really don't care about speed, just code
> reuse...it's irritating to have to manually block in such a situation.

Actually, s/file/database/g, just to avoid comments about np.memmap.

Dag

>
>    - You want to play with a new fancy array format (a blocked format
> that's faster for some linear algebra, say). But then you need to call
> another C library that takes data+shape+stride+dtype. Then you need to
> make a copy, which you'd rather avoid -- so an alternative is to make
> that C library be based on the block API so that it can interfaces with
> your fancy new format (and anything else).
>
> 2) Bandwidth conservation. Numexpr, Theano, Numba and
> Cython-vectorized-expressions all already deal with this problem on one
> level.
>
> However, I also believe there's a higher level in many programs where
> blocks come into play. The organization of many scientific codes
> essentially reads "do A on 8 GB of data, then do B on 8 GB of data"; and
> it's going to be a *long* time before you have full-program analysis to
> block up that in every case; the tools above will be able to deal with
> some almost-trivial cases only.
>
> A block API could be used to write "pipeline programs". For instance,
> consider
>
> a_array[:, None] * f(x_array)
>
> and f is some rather complicated routine in Fortran that is NOT a ufunc
> -- it takes all of x_array as input and provides all the output "at
> once"; but at some point it needs to do the writing of the output, and
> if writing to an API it could do the multiplication with a_array while
> the block is in cache anyway and save a memory bus round-trip. (Provided
> the output isn't needed as scratch space.)
>
> Summing up:
>
> Vectorized expressions on contiguous (or strided) memory in Cython is
> pretty nice by itself; it would bring us up to the current state of the
> art of static compilation (in Fortran compilers).
>
> But my hunch is your sparse-block proposal doesn't add enough
> flexibility to be worth the pain. Many of the cases above can't be
> covered with it.
>
> If it's possible to identify a little nugget API that is
> forward-compatible with the above usecases (it wouldn't solve them
> perhaps, but allow them to be solved with some extra supporting code),
> then it might be worth to a) implement it in NumPy, b) support it for
> Cython vectorized expressions; and use that to support block-transposition.
>
> But I'm starting to feel that we can't really know how that little
> nugget API should really look until the above has been explored in a
> little more depth; and then we are easily talking a too large scope
> without tying it to a larger project (which can't really before the
> GSoC..). For instance, 2) suggests push/pull rather than put/get.
>
> Dag
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion




More information about the NumPy-Discussion mailing list