[Numpy-discussion] Fixing issue of future opaqueness of ndarray this summer

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Sat May 12 16:38:18 EDT 2012


On 05/11/2012 10:10 PM, Mark Wiebe wrote:
> On Fri, May 11, 2012 at 6:13 AM, Dag Sverre Seljebotn
> <d.s.seljebotn at astro.uio.no <mailto:d.s.seljebotn at astro.uio.no>> wrote:
>
>     (NumPy devs: I know, I get too many ideas. But this time I *really*
>     believe in it, I think this is going to be *huge*. And if Mark F. likes
>     it it's not going to be without manpower; and as his mentor I'd pitch in
>     too here and there.)
>
>     (Mark F.: I believe this is *very* relevant to your GSoC. I certainly
>     don't want to micro-manage your GSoC, just have your take.)
>
>     Travis, thank you very much for those good words in the "NA-mask
>     interactions..." thread. It put most of my concerns away. If anybody is
>     leaning towards for opaqueness because of its OOP purity, I want to
>     refer to C++ and its walled-garden of ideological purity -- it has,
>     what, 3-4 different OOP array libraries, neither of which is able to
>     out-compete the other. Meanwhile the rest of the world happily
>     cooperates using pointers, strides, CSR and CSC.
>
>     Now, there are limits to what you can do with strides and pointers.
>     Noone's denying the need for more. In my mind that's an API where you
>     can do fetch_block and put_block of cache-sized, N-dimensional blocks on
>     an array; but it might be something slightly different.
>
>     Here's what I'm asking: DO NOT simply keep extending ndarray and the
>     NumPy C API to deal with this issue.
>
>     What we need is duck-typing/polymorphism at the C level. If you keep
>     extending ndarray and the NumPy C API, what we'll have is a one-to-many
>     relationship: One provider of array technology, multiple consumers (with
>     hooks, I'm sure, but all implementations of the hook concept in the
>     NumPy world I've seen so far are a total disaster!).
>
>
> There is similar intent behind an idea I raised last summer here:
>
> http://mail.scipy.org/pipermail/numpy-discussion/2011-June/056945.html
>
> I stopped doing anything on it when considering the scope of it with the
> linear algebra functions included like Pauli suggested. Nathaniel did an
> experimental  implementation of some parts of the idea in Python here:

Hmm. Ufuncs. I'll think more about that, and read the thread again, to 
see if I can discover the connection to my proposal above. (Re your 
thread, I always wished for something based on multiple dispatch rather 
than a priority battle to resolve NumPy operators; you really only have 
a partial ordering between libraries wanting to interact with ndarrays, 
not a total ordering)

Re linear algebra, I have, since I use it in my PhD, been working on and 
off the past 4 months on a library for polymorphic linear algebra. 
There's something analogue to the array and ufunc distinction of NumPy, 
but the "ufunc" part is designed very differently (and based on multiple 
dispatch).

Anyway, what I was talking about was basically a new C interface to 
arrays that should be more tolerant for implementations that need to be 
opaque (like compressed arrays, database connections, my own custom 
sparse format...). If that somehow relates with ufuncs, fine with me :-)



>
> https://github.com/njsmith/numpyNEP/blob/master/numpyNEP.py#L107
>
>     What I think we need instead is something like PEP 3118 for the
>     "abstract" array that is only available block-wise with getters and
>     setters. On the Cython list we've decided that what we want for CEP 1000
>     (for boxing callbacks etc.) is to extend PyTypeObject with our own
>     fields; we could create CEP 1001 to solve this issue and make any Python
>     object an exporter of "block-getter/setter-arrays" (better name needed).
>
>     What would be exported is (of course) a simple vtable:
>
>     typedef struct {
>          int (*get_block)(void *ctx, ssize_t *upper_left, ssize_t
>     *lower_right, ...);
>          ...
>     } block_getter_setter_array_vtable;
>
>     Let's please discuss the details *after* the fundamentals. But the
>     reason I put void* there instead of PyObject* is that I hope this could
>     be used beyond the Python world (say, Python<->Julia); the void* would
>     be handed to you at the time you receive the vtable (however we handle
>     that).
>
>     I think this would fit neatly in Mark F.'s GSoC (Mark F.?), because you
>     could embed the block-transposition that's needed for efficient "arr +
>     arr.T" at this level.
>
>     Imagine being able to do this in Cython:
>
>     a[...] = b + c * d
>
>     and have that essentially compile to the numexpr blocked approach, *but*
>     where b, c, and d can have whatever type that exports CEP 1001? So c
>     could be a "diagonal" array which uses O(n) storage to export O(n^2)
>     elements, for instance, and the unrolled Cython code never needs to
>     know.
>
>     As far as NumPy goes, something along these lines should hopefully mean
>     that new C code being written doesn't rely so much on what exactly goes
>     into "ndarray" and what goes into other classes; so that we don't get
>     the same problem again that we do now with code that doesn't use PEP
>     3118.
>
>
> This general idea is very good. I think PEP 3118 captures a lot of the
> essence of the ndarray, but there's a lot of potential generality that
> it doesn't handle, such as the "diagonal" array or pluggable dtypes.

I think the long-term generality is a lot bigger than that:

  - Compressed arrays
  - Interfaces to HDF files
  - Distributed-memory arrays
  - Blocked arrays
  - Semi-sparse and sparse (diagonal, but also triangular, symmetric, 
repeating, ...)
  - Lazy evaluation: "generating_multiply(mydata, zero_mask)"

While what me and Mark F. cares about is computational efficiency for 
current arrays, this generality is almost unavoidable.

In fact -- from ideas Travis have posted to this list earlier + 
continuum.io, I assume this wider scope is something you and Travis must 
necessarily have thought a lot about.

Anyway, I agree with Mark F. that right design is probably a new, 
low-level, (very small!) C library with no Python dependencies that just 
provides some APIs to try to standardize this "how to communicate array 
data" at a more basic level than NumPy (and much smaller and different 
scope than the various "distill NumPy to a C core" things that's been 
talked about the past years, something I have zero interest in).

If NumPy devs are interested in this discussion on a detailed level, 
please say so; me and Mark F might go to Skype (or even meet in person) 
to get higher bandwidth than ML, and if more people should be invited 
then it's good to know.

Dag



More information about the NumPy-Discussion mailing list