[Numpy-discussion] Fixing issue of future opaqueness of ndarray this summer

mark florisson markflorisson88 at gmail.com
Tue May 15 06:42:17 EDT 2012


On 14 May 2012 21:36, Dag Sverre Seljebotn <d.s.seljebotn at astro.uio.no> wrote:
> On 05/14/2012 06:31 PM, mark florisson wrote:
>>
>> On 12 May 2012 22:55, Dag Sverre Seljebotn<d.s.seljebotn at astro.uio.no>
>>  wrote:
>>>
>>> On 05/11/2012 03:37 PM, mark florisson wrote:
>>>>
>>>>
>>>> On 11 May 2012 12:13, Dag Sverre Seljebotn<d.s.seljebotn at astro.uio.no>
>>>>  wrote:
>>>>>
>>>>>
>>>>> (NumPy devs: I know, I get too many ideas. But this time I *really*
>>>>> believe
>>>>> in it, I think this is going to be *huge*. And if Mark F. likes it it's
>>>>> not
>>>>> going to be without manpower; and as his mentor I'd pitch in too here
>>>>> and
>>>>> there.)
>>>>>
>>>>> (Mark F.: I believe this is *very* relevant to your GSoC. I certainly
>>>>> don't
>>>>> want to micro-manage your GSoC, just have your take.)
>>>>>
>>>>> Travis, thank you very much for those good words in the "NA-mask
>>>>> interactions..." thread. It put most of my concerns away. If anybody is
>>>>> leaning towards for opaqueness because of its OOP purity, I want to
>>>>> refer
>>>>> to
>>>>> C++ and its walled-garden of ideological purity -- it has, what, 3-4
>>>>> different OOP array libraries, neither of which is able to out-compete
>>>>> the
>>>>> other. Meanwhile the rest of the world happily cooperates using
>>>>> pointers,
>>>>> strides, CSR and CSC.
>>>>>
>>>>> Now, there are limits to what you can do with strides and pointers.
>>>>> Noone's
>>>>> denying the need for more. In my mind that's an API where you can do
>>>>> fetch_block and put_block of cache-sized, N-dimensional blocks on an
>>>>> array;
>>>>> but it might be something slightly different.
>>>>>
>>>>> Here's what I'm asking: DO NOT simply keep extending ndarray and the
>>>>> NumPy C
>>>>> API to deal with this issue.
>>>>>
>>>>> What we need is duck-typing/polymorphism at the C level. If you keep
>>>>> extending ndarray and the NumPy C API, what we'll have is a one-to-many
>>>>> relationship: One provider of array technology, multiple consumers
>>>>> (with
>>>>> hooks, I'm sure, but all implementations of the hook concept in the
>>>>> NumPy
>>>>> world I've seen so far are a total disaster!).
>>>>>
>>>>> What I think we need instead is something like PEP 3118 for the
>>>>> "abstract"
>>>>> array that is only available block-wise with getters and setters. On
>>>>> the
>>>>> Cython list we've decided that what we want for CEP 1000 (for boxing
>>>>> callbacks etc.) is to extend PyTypeObject with our own fields; we could
>>>>> create CEP 1001 to solve this issue and make any Python object an
>>>>> exporter
>>>>> of "block-getter/setter-arrays" (better name needed).
>>>>>
>>>>> What would be exported is (of course) a simple vtable:
>>>>>
>>>>> typedef struct {
>>>>>    int (*get_block)(void *ctx, ssize_t *upper_left, ssize_t
>>>>> *lower_right,
>>>>> ...);
>>>>>    ...
>>>>> } block_getter_setter_array_vtable;
>>>>>
>>>>> Let's please discuss the details *after* the fundamentals. But the
>>>>> reason
>>>>> I
>>>>> put void* there instead of PyObject* is that I hope this could be used
>>>>> beyond the Python world (say, Python<->Julia); the void* would be
>>>>> handed
>>>>> to
>>>>> you at the time you receive the vtable (however we handle that).
>>>>
>>>>
>>>>
>>>> I suppose it would also be useful to have some way of predicting the
>>>> output format polymorphically for the caller. E.g. dense *
>>>> block_diagonal results in block diagonal, but dense + block_diagonal
>>>> results in dense, etc. It might be useful for the caller to know
>>>> whether it needs to allocate a sparse, dense or block-structured
>>>> array. Or maybe the polymorphic function could even do the allocation.
>>>> This needs to happen recursively of course, to avoid intermediate
>>>> temporaries. The compiler could easily handle that, and so could numpy
>>>> when it gets lazy evaluation.
>>>
>>>
>>>
>>> Ah. But that depends too on the computation to be performed too; a)
>>> elementwise, b) axis-wise reductions, c) linear algebra...
>>>
>>> In my oomatrix code (please don't look at it, it's shameful) I do this
>>> using
>>> multiple dispatch.
>>>
>>> I'd rather ignore this for as long as we can, only implementing "a[:] =
>>> ..."
>>> -- I can't see how decisions here would trickle down to the API that's
>>> used
>>> in the kernel, it's more like a pre-phase, and better treated
>>> orthogonally.
>>>
>>>
>>>> I think if the heavy lifting of allocating output arrays and exporting
>>>> these arrays work in numpy, then support in Cython could use that (I
>>>> can already hear certain people object to more complicated array stuff
>>>> in Cython :). Even better here would be an external project that each
>>>> our projects could use (I still think the nditer sorting functionality
>>>> of arrays should be numpy-agnostic and externally available).
>>>
>>>
>>>
>>> I agree with the separate project idea. It's trivial for NumPy to
>>> incorporate that as one of its methods for exporting arrays, and I don't
>>> think it makes sense to either build it into Cython, or outright depend
>>> on
>>> NumPy.
>>>
>>> Here's what I'd like (working title: NumBridge?).
>>>
>>>  - Mission: Be the "double* + shape + strides" in a world where that is
>>> no
>>> longer enough, by providing tight, focused APIs/ABIs that are usable
>>> across
>>> C/Fortran/Python.
>>>
>>> I basically want something I can quickly acquire from a NumPy array, then
>>> pass it into my C code without dragging along all the cruft that I don't
>>> need.
>>>
>>>  - Written in pure C + specs, usable without Python
>>>
>>>  - PEP 3118 "done right", basically semi-standardize the internal Cython
>>> memoryview ABI and get something that's passable on stack
>>>
>>>  - Get block get/put API
>>>
>>>  - Iterator APIs
>>>
>>>  - Utility code for exporters and clients (iteration code, axis
>>> reordering,
>>> etc.)
>>>
>>> Is the scope of that insane, or is it at least worth a shot to see how
>>> bad
>>> it is? Beyond figuring out a small subset that can be done first, and
>>> whether performance considerations must be taken or not, there's two
>>> complicating factors: Pluggable dtypes, memory management. Perhaps you
>>> could
>>> come to Oslo for a couple of days to brainstorm...
>>>
>>> Dag
>>
>>
>> The blocks are a good idea, but I think fairly complicated for
>> complicated matrix layouts. It would be nice to have something that is
>> reasonably efficient for at least most of the array storage
>> mechanisms.
>> I'm going to do a little brain dump below, let's see if anything is useful
>> :)
>>
>> What if we basically take the CSR format and make it a little simpler,
>> easier to handle, and better suited for other layouts. Basically, keep
>> shape/strides for dense arrays, and for blocked arrays just "extend"
>> your number of dimensions, i.e. a 2D blocked array becomes a 4D array,
>> something like this:
>>
>>>>> a = np.arange(4).repeat(4).reshape(4, 4);
>>>>> a
>>
>> array([[0, 0, 0, 0],
>>            [1, 1, 1, 1],
>>            [2, 2, 2, 2],
>>            [3, 3, 3, 3]])
>>
>>>>> a.shape = (2, 2, 2, 2)
>>>>> itemsize = a.dtype.itemsize
>>>>> a.strides = (8 * itemsize, 2 * itemsize, 4 * itemsize, 1 * itemsize)
>>>>> a
>>
>> array([[[[0, 0],
>>          [1, 1]],
>>
>>         [[0, 0],
>>          [1, 1]]],
>>
>>
>>        [[[2, 2],
>>          [3, 3]],
>>
>>         [[2, 2],
>>          [3, 3]]]])
>>
>>>>> print a.flatten()
>>
>> [0 0 1 1 0 0 1 1 2 2 3 3 2 2 3 3]
>>
>> Now, given some buffer flag (PyBUF_Sparse or something), use basically
>> suboffsets with indirect dimensions, where only ever the last
>> dimension is a row of contiguous memory (the entire thing may be
>> contiguous, but the point is that you don't care). This row may
>>
>>     - be variable sized
>>     - have either a single "column start" (e.g. diagonal, band/tri-
>> diagonal, block diagonal, etc), OR
>>     - a list of column indices, the length of the row (like in the CSR
>> format)
>>
>> The length of each innermost row is then determined by looking at, in
>> order:
>>     - the extent as specified in the shape list
>>     - if -1, and some flag is set, the extent is determined like CSR,
>> i.e. (uintptr_t) row[i + 1] - (uintptr_t) row[i]
>>         ->  maybe instead of pointers indices are better, for
>> serialization, GPUs, etc
>>     - otherwise, use either a function pointer or perhaps a list of
>> extents
>>
>> All these details will obviously be abstracted, allowing for easy
>> iteration, but it can also be used by ufuncs operating on contiguous
>> rows (often, since the actual storage is contiguous and stored in a 1D
>> array, some flag could indicate contiguity as an optimization for
>> unary ufuncs and flat iteration). Tiled nditer-ation could also work
>> here without too big a hassle.
>> When you slice, you add to the suboffset and manipulate the single
>> extent or entire list of extents in that dimension, and the flag to
>> determine the length using the pointer subtraction should be cleared
>> (this should probably all happen through vtable functions).
>>
>> An exporter would also be free to use different malloced pointers,
>> allowing variable sized array support with append/pop functionality in
>> multiple dimensions (if there are no active buffer views).
>>
>> Random access in the case where a column start is provided is still
>> contant time, and done though a[i][j][k - colstart], unless you have
>> discontiguous rows, in which case you are allowed a logarithmic search
>> (if the size exceeds some threshold). I see scipy.sparse does a linear
>> search, which is pretty slow in pathological cases:
>>
>> from scipy import sparse
>> a = np.random.random((1, 4000000))
>> b = sparse.csr_matrix(a)
>> %timeit a[0, 1000]
>> %timeit b[0, 1000]
>>
>> 10000000 loops, best of 3: 190 ns per loop
>> 10 loops, best of 3: 29.3 ms per loop
>
>
> Heh. That is *extremely* pathological though, nobody does that in real code
> :-)
>
> Here's an idea I had yesterday: To get an ND sparse array with good spatial
> locality (for local operations etc.), you could map the elements to a
> volume-filling fractal and then use a hash-map with linear probing. I bet it
> either doesn't work or has been done already :-)
>
>
>>
>> Now, depending on the density and operation, the caller may have some
>> idea of how to allocate an output array. I'm not sure how to handle
>> "insertions" of new elements, but I presume through vtable put/insert
>> functions. I'm also not sure how to fit this in with linear algebra
>> functionality, other than providing conversions of the view.
>>
>> I think a disadvantage of this scheme is that you can't reorder your
>> axes anymore, and many operations that are easy in the dense case are
>> suddenly harder, e.g. this scheme does not allow you to go from a
>> csr-like format into csc. But I think what this gives is reasonable
>> generality to allow easy use in C/Fortran/Cython compiled code/numpy
>> ufunc invocation, as well as allowing efficient-ish storage for
>> various kinds of arrays.
>>
>> Does this make any sense?
>
>
> I'll admit I didn't go through the finer details of your idea; let's deal
> with the below first and then I can re-read your post later.
>
> What I'm thinking is that this seems interesting, but perhaps it lowers the
> versatility so much that it's not really worth to consider, for the GSoC at
> least. If the impact isn't high enough, my hunch is that one may as well not
> bother and just do strided arrays.
>
> I actually believe that the *likely* outcome of this discussion is to stick
> to the original plan and focus on expressions with strided arrays. But let's
> see.
>
> I'm not sure if what brought me to the blocked case was really block-sparse
> arrays or diagonal arrays. Let's see...:
>
> 1) Memory conservation. The array would not be NOT element-sparse, it's just
> that you don't want to hold it all in memory at once.
>
> Examples:
>
>  - "a_array *= np.random.normal(size=a_array.shape)". The right hand side
> can be generated on the fly (making it return the same data for each block
> every time is non-trivial but can be done). If a takes up a good chunk of
> RAM, there might not even be enough memory for the right-hand-side except
> for block-by-block. (That saves bandwidth too.)
>
>  - You want to stream from one file to another file, neither of which will
> fit in RAM. Here we really don't care about speed, just code reuse...it's
> irritating to have to manually block in such a situation.
>
>  - You want to play with a new fancy array format (a blocked format that's
> faster for some linear algebra, say). But then you need to call another C
> library that takes data+shape+stride+dtype. Then you need to make a copy,
> which you'd rather avoid -- so an alternative is to make that C library be
> based on the block API so that it can interfaces with your fancy new format
> (and anything else).
>
> 2) Bandwidth conservation. Numexpr, Theano, Numba and
> Cython-vectorized-expressions all already deal with this problem on one
> level.
>
> However, I also believe there's a higher level in many programs where blocks
> come into play. The organization of many scientific codes essentially reads
> "do A on 8 GB of data, then do B on 8 GB of data"; and it's going to be a
> *long* time before you have full-program analysis to block up that in every
> case; the tools above will be able to deal with some almost-trivial cases
> only.
>
> A block API could be used to write "pipeline programs". For instance,
> consider
>
> a_array[:, None] * f(x_array)
>
> and f is some rather complicated routine in Fortran that is NOT a ufunc --
> it takes all of x_array as input and provides all the output "at once"; but
> at some point it needs to do the writing of the output, and if writing to an
> API it could do the multiplication with a_array while the block is in cache
> anyway and save a memory bus round-trip. (Provided the output isn't needed
> as scratch space.)
>
> Summing up:
>
> Vectorized expressions on contiguous (or strided) memory in Cython is pretty
> nice by itself; it would bring us up to the current state of the art of
> static compilation (in Fortran compilers).
>
> But my hunch is your sparse-block proposal doesn't add enough flexibility to
> be worth the pain. Many of the cases above can't be covered with it.
>
> If it's possible to identify a little nugget API that is forward-compatible
> with the above usecases (it wouldn't solve them perhaps, but allow them to
> be solved with some extra supporting code), then it might be worth to a)
> implement it in NumPy, b) support it for Cython vectorized expressions; and
> use that to support block-transposition.
>
> But I'm starting to feel that we can't really know how that little nugget
> API should really look until the above has been explored in a little more
> depth; and then we are easily talking a too large scope without tying it to
> a larger project (which can't really before the GSoC..). For instance, 2)
> suggests push/pull rather than put/get.
>
> Dag

I assumed as much, in any case I was going to start with dense arrays,
simply because they are in common use and well-defined at this point.
Maybe what we really want is just lazy evaluation that works for
different array layouts and storage mechanisms, and a smart JIT that
can evaluate our linear algebra and array expressions in an optimal
fashion (memory/cache-wise, but also out-of-core wise). This probably
means writing everything in a high level language, because even if
your Fortran routines themselves would use your lazy evaluation
library, your linear algebra library wouldn't for sure :) Anyway, I'm
going to be pragmatic and investigate how much of Theano we can reuse
in Cython now.



More information about the NumPy-Discussion mailing list