[Numpy-discussion] Fixing issue of future opaqueness of ndarray this summer

Dag Sverre Seljebotn d.s.seljebotn at astro.uio.no
Sat May 12 17:35:30 EDT 2012

On 05/11/2012 03:25 PM, mark florisson wrote:
> On 11 May 2012 12:13, Dag Sverre Seljebotn<d.s.seljebotn at astro.uio.no>  wrote:
>> (NumPy devs: I know, I get too many ideas. But this time I *really* believe
>> in it, I think this is going to be *huge*. And if Mark F. likes it it's not
>> going to be without manpower; and as his mentor I'd pitch in too here and
>> there.)
>> (Mark F.: I believe this is *very* relevant to your GSoC. I certainly don't
>> want to micro-manage your GSoC, just have your take.)
>> Travis, thank you very much for those good words in the "NA-mask
>> interactions..." thread. It put most of my concerns away. If anybody is
>> leaning towards for opaqueness because of its OOP purity, I want to refer to
>> C++ and its walled-garden of ideological purity -- it has, what, 3-4
>> different OOP array libraries, neither of which is able to out-compete the
>> other. Meanwhile the rest of the world happily cooperates using pointers,
>> strides, CSR and CSC.

BTW, this seems like pure hyperbole out of context...the backstory is 
that I've talked to people who do sparse linear algebra in C++ using 
Boost, iterators, etc. rather than something that could conceivably be 
exported with a binary interface; primarily out of a wish to be more 
elegant than 'legacy' programming languages that have to resort to 
things like CSR. Sure, that's elegant modern C++, but at what cost?

>> Now, there are limits to what you can do with strides and pointers. Noone's
>> denying the need for more. In my mind that's an API where you can do
>> fetch_block and put_block of cache-sized, N-dimensional blocks on an array;
>> but it might be something slightly different.
>> Here's what I'm asking: DO NOT simply keep extending ndarray and the NumPy C
>> API to deal with this issue.
>> What we need is duck-typing/polymorphism at the C level. If you keep
>> extending ndarray and the NumPy C API, what we'll have is a one-to-many
>> relationship: One provider of array technology, multiple consumers (with
>> hooks, I'm sure, but all implementations of the hook concept in the NumPy
>> world I've seen so far are a total disaster!).
>> What I think we need instead is something like PEP 3118 for the "abstract"
>> array that is only available block-wise with getters and setters. On the
>> Cython list we've decided that what we want for CEP 1000 (for boxing
>> callbacks etc.) is to extend PyTypeObject with our own fields; we could
>> create CEP 1001 to solve this issue and make any Python object an exporter
>> of "block-getter/setter-arrays" (better name needed).
>> What would be exported is (of course) a simple vtable:
>> typedef struct {
>>     int (*get_block)(void *ctx, ssize_t *upper_left, ssize_t *lower_right,
>> ...);
>>     ...
>> } block_getter_setter_array_vtable;
> Interesting idea, I think I like it. So I suppose these blocks could
> even be used in a sparse context, where it returns one-sized blocks
> for un-surrounded elements. This means returned blocks can always be
> variably sized, correct? So will each block be associated with an
> index space? Otherwise, how do we know the element-wise
> correspondence? Will there be a default element for missing
> (unreturned) values?
> Am I correct in assuming elements in returned blocks always return
> elements in element-wise order? In case you manually have to block
> other parts of an array (lets say it's stored in Fortran order), then
> the variable sized block nature may be complicating the tiling
> pattern.

  - My hunch was that you'd always get something C-contiguous and the 
exporter would copy/transpose. But perhaps one wouldn't loose much by 
making the block strided and have the consumer do a copy; isn't that 
mostly about which side of the boundary to call the copying utility?

  - I think iterators over blocks are necesarry as well

  - Not sure about variable-sized blocks. The array could already be 
stored in cache-sized blocks, and the numerical loops could be optimized 
for some block sizes, so there must certainly be some handshaking 
mechanism; hopefully it can be made rather elegant. See next point for 
one ingredient.

  - What must be established is whether a) one should always copy, b) 
one can copy if one wants to at a negligible cost, or c) computational 
core reaching into the buffer of an exporter is better. This would 
trickle down into lots of other decisions I feel (like if you always go 
for a rectangular block, or whether getting entire rows like with the 
NumPy iteration API is sometimes more valuable).

In addition to the Fortran compilers, one thing to study is 
GotoBLAS2/OpenBLAS, which has code for getting a "high-quality memory 
region", since that apparently made an impact. Not sure how much Intel 
platforms though; other platforms have slower TLBs, in fact I think on 
non-Intel platforms, TLB is what affected block size, not the cache size.

>> Let's please discuss the details *after* the fundamentals. But the reason I
>> put void* there instead of PyObject* is that I hope this could be used
>> beyond the Python world (say, Python<->Julia); the void* would be handed to
>> you at the time you receive the vtable (however we handle that).
> Yes, we should definitely not stick to objects here.

In fact, if we can get this done right, I'd have uses for such an API in 
my pure C/C++ libraries that I want to be usable from Fortran without a 
Python dependency...

>> I think this would fit neatly in Mark F.'s GSoC (Mark F.?), because you
>> could embed the block-transposition that's needed for efficient "arr +
>> arr.T" at this level.
>> Imagine being able to do this in Cython:
>> a[...] = b + c * d
>> and have that essentially compile to the numexpr blocked approach, *but*
>> where b, c, and d can have whatever type that exports CEP 1001? So c could
>> be a "diagonal" array which uses O(n) storage to export O(n^2) elements, for
>> instance, and the unrolled Cython code never needs to know.
> I assume random accesses will happen through some kind of
> B-tree/sparse index like mechanism?

Here I don't follow. Accesses happens polymorphically, in whatever way 
the array in question wants to. I already wrote some examples in the 
email to Mark W.; on the computational side, there's certainly a lot of 
different sparse formats, regular C/Fortan, blocked formats (more 
efficient linear algebra), volume-filling fractals (for spatial locality 
in N dimensions, probably not that helpful in single-node contexts 


More information about the NumPy-Discussion mailing list