[Numpy-discussion] Fixing issue of future opaqueness of ndarray this summer

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Sat May 12 17:43:29 EDT 2012

On 05/12/2012 11:35 PM, Dag Sverre Seljebotn wrote:
> On 05/11/2012 03:25 PM, mark florisson wrote:
>> On 11 May 2012 12:13, Dag Sverre Seljebotn<d.s.seljebotn at astro.uio.no>   wrote:
>>> (NumPy devs: I know, I get too many ideas. But this time I *really* believe
>>> in it, I think this is going to be *huge*. And if Mark F. likes it it's not
>>> going to be without manpower; and as his mentor I'd pitch in too here and
>>> there.)
>>> (Mark F.: I believe this is *very* relevant to your GSoC. I certainly don't
>>> want to micro-manage your GSoC, just have your take.)
>>> Travis, thank you very much for those good words in the "NA-mask
>>> interactions..." thread. It put most of my concerns away. If anybody is
>>> leaning towards for opaqueness because of its OOP purity, I want to refer to
>>> C++ and its walled-garden of ideological purity -- it has, what, 3-4
>>> different OOP array libraries, neither of which is able to out-compete the
>>> other. Meanwhile the rest of the world happily cooperates using pointers,
>>> strides, CSR and CSC.
> BTW, this seems like pure hyperbole out of context...the backstory is
> that I've talked to people who do sparse linear algebra in C++ using
> Boost, iterators, etc. rather than something that could conceivably be
> exported with a binary interface; primarily out of a wish to be more
> elegant than 'legacy' programming languages that have to resort to
> things like CSR. Sure, that's elegant modern C++, but at what cost?
>>> Now, there are limits to what you can do with strides and pointers. Noone's
>>> denying the need for more. In my mind that's an API where you can do
>>> fetch_block and put_block of cache-sized, N-dimensional blocks on an array;
>>> but it might be something slightly different.
>>> Here's what I'm asking: DO NOT simply keep extending ndarray and the NumPy C
>>> API to deal with this issue.
>>> What we need is duck-typing/polymorphism at the C level. If you keep
>>> extending ndarray and the NumPy C API, what we'll have is a one-to-many
>>> relationship: One provider of array technology, multiple consumers (with
>>> hooks, I'm sure, but all implementations of the hook concept in the NumPy
>>> world I've seen so far are a total disaster!).
>>> What I think we need instead is something like PEP 3118 for the "abstract"
>>> array that is only available block-wise with getters and setters. On the
>>> Cython list we've decided that what we want for CEP 1000 (for boxing
>>> callbacks etc.) is to extend PyTypeObject with our own fields; we could
>>> create CEP 1001 to solve this issue and make any Python object an exporter
>>> of "block-getter/setter-arrays" (better name needed).
>>> What would be exported is (of course) a simple vtable:
>>> typedef struct {
>>>      int (*get_block)(void *ctx, ssize_t *upper_left, ssize_t *lower_right,
>>> ...);
>>>      ...
>>> } block_getter_setter_array_vtable;
>> Interesting idea, I think I like it. So I suppose these blocks could
>> even be used in a sparse context, where it returns one-sized blocks
>> for un-surrounded elements. This means returned blocks can always be
>> variably sized, correct? So will each block be associated with an
>> index space? Otherwise, how do we know the element-wise
>> correspondence? Will there be a default element for missing
>> (unreturned) values?
>> Am I correct in assuming elements in returned blocks always return
>> elements in element-wise order? In case you manually have to block
>> other parts of an array (lets say it's stored in Fortran order), then
>> the variable sized block nature may be complicating the tiling
>> pattern.
>    - My hunch was that you'd always get something C-contiguous and the
> exporter would copy/transpose. But perhaps one wouldn't loose much by
> making the block strided and have the consumer do a copy; isn't that
> mostly about which side of the boundary to call the copying utility?
>    - I think iterators over blocks are necesarry as well
>    - Not sure about variable-sized blocks. The array could already be
> stored in cache-sized blocks, and the numerical loops could be optimized
> for some block sizes, so there must certainly be some handshaking
> mechanism; hopefully it can be made rather elegant. See next point for
> one ingredient.
>    - What must be established is whether a) one should always copy, b)
> one can copy if one wants to at a negligible cost, or c) computational
> core reaching into the buffer of an exporter is better. This would
> trickle down into lots of other decisions I feel (like if you always go
> for a rectangular block, or whether getting entire rows like with the
> NumPy iteration API is sometimes more valuable).
> In addition to the Fortran compilers, one thing to study is
> GotoBLAS2/OpenBLAS, which has code for getting a "high-quality memory
> region", since that apparently made an impact. Not sure how much Intel
> platforms though; other platforms have slower TLBs, in fact I think on
> non-Intel platforms, TLB is what affected block size, not the cache size.
>>> Let's please discuss the details *after* the fundamentals. But the reason I
>>> put void* there instead of PyObject* is that I hope this could be used
>>> beyond the Python world (say, Python<->Julia); the void* would be handed to
>>> you at the time you receive the vtable (however we handle that).
>> Yes, we should definitely not stick to objects here.
> In fact, if we can get this done right, I'd have uses for such an API in
> my pure C/C++ libraries that I want to be usable from Fortran without a
> Python dependency...
>>> I think this would fit neatly in Mark F.'s GSoC (Mark F.?), because you
>>> could embed the block-transposition that's needed for efficient "arr +
>>> arr.T" at this level.
>>> Imagine being able to do this in Cython:
>>> a[...] = b + c * d
>>> and have that essentially compile to the numexpr blocked approach, *but*
>>> where b, c, and d can have whatever type that exports CEP 1001? So c could
>>> be a "diagonal" array which uses O(n) storage to export O(n^2) elements, for
>>> instance, and the unrolled Cython code never needs to know.
>> I assume random accesses will happen through some kind of
>> B-tree/sparse index like mechanism?
> Here I don't follow. Accesses happens polymorphically, in whatever way
> the array in question wants to. I already wrote some examples in the
> email to Mark W.; on the computational side, there's certainly a lot of
> different sparse formats, regular C/Fortan, blocked formats (more
> efficient linear algebra), volume-filling fractals (for spatial locality
> in N dimensions, probably not that helpful in single-node contexts
> though)...

An example that's actually non-contrived and interesting is H-matrices 
and H^2-matrices ("hierarchical matrices"). Basically it's a way of 
compressing matrices so that you can get approximate inverses, 
matrix-multiply etc. in O(log(n)^p n), with p <=2.

Briefly, you first construct a quad-tree of your matrix; then you 
approximate the blocks that can be approximated using low-rank 
approximations (SVDs or similar). And for certain matrices that appears 
to do wonders. (This is mostly used for matrices that produce O(n^2) 
elements non-trivially from O(n) parameters).

(http://hlib.org is a starting point, but it's just theorem upon 
theorem, mostly in a PDE setting).


More information about the NumPy-Discussion mailing list