[Numpy-discussion] Fixing issue of future opaqueness of ndarray this summer
Dag Sverre Seljebotn
d.s.seljebotn at astro.uio.no
Sat May 12 17:35:30 EDT 2012
On 05/11/2012 03:25 PM, mark florisson wrote:
> On 11 May 2012 12:13, Dag Sverre Seljebotn<d.s.seljebotn at astro.uio.no> wrote:
>> (NumPy devs: I know, I get too many ideas. But this time I *really* believe
>> in it, I think this is going to be *huge*. And if Mark F. likes it it's not
>> going to be without manpower; and as his mentor I'd pitch in too here and
>> there.)
>>
>> (Mark F.: I believe this is *very* relevant to your GSoC. I certainly don't
>> want to micro-manage your GSoC, just have your take.)
>>
>> Travis, thank you very much for those good words in the "NA-mask
>> interactions..." thread. It put most of my concerns away. If anybody is
>> leaning towards for opaqueness because of its OOP purity, I want to refer to
>> C++ and its walled-garden of ideological purity -- it has, what, 3-4
>> different OOP array libraries, neither of which is able to out-compete the
>> other. Meanwhile the rest of the world happily cooperates using pointers,
>> strides, CSR and CSC.
BTW, this seems like pure hyperbole out of context...the backstory is
that I've talked to people who do sparse linear algebra in C++ using
Boost, iterators, etc. rather than something that could conceivably be
exported with a binary interface; primarily out of a wish to be more
elegant than 'legacy' programming languages that have to resort to
things like CSR. Sure, that's elegant modern C++, but at what cost?
>>
>> Now, there are limits to what you can do with strides and pointers. Noone's
>> denying the need for more. In my mind that's an API where you can do
>> fetch_block and put_block of cache-sized, N-dimensional blocks on an array;
>> but it might be something slightly different.
>>
>> Here's what I'm asking: DO NOT simply keep extending ndarray and the NumPy C
>> API to deal with this issue.
>>
>> What we need is duck-typing/polymorphism at the C level. If you keep
>> extending ndarray and the NumPy C API, what we'll have is a one-to-many
>> relationship: One provider of array technology, multiple consumers (with
>> hooks, I'm sure, but all implementations of the hook concept in the NumPy
>> world I've seen so far are a total disaster!).
>>
>> What I think we need instead is something like PEP 3118 for the "abstract"
>> array that is only available block-wise with getters and setters. On the
>> Cython list we've decided that what we want for CEP 1000 (for boxing
>> callbacks etc.) is to extend PyTypeObject with our own fields; we could
>> create CEP 1001 to solve this issue and make any Python object an exporter
>> of "block-getter/setter-arrays" (better name needed).
>>
>> What would be exported is (of course) a simple vtable:
>>
>> typedef struct {
>> int (*get_block)(void *ctx, ssize_t *upper_left, ssize_t *lower_right,
>> ...);
>> ...
>> } block_getter_setter_array_vtable;
>
> Interesting idea, I think I like it. So I suppose these blocks could
> even be used in a sparse context, where it returns one-sized blocks
> for un-surrounded elements. This means returned blocks can always be
> variably sized, correct? So will each block be associated with an
> index space? Otherwise, how do we know the element-wise
> correspondence? Will there be a default element for missing
> (unreturned) values?
>
> Am I correct in assuming elements in returned blocks always return
> elements in element-wise order? In case you manually have to block
> other parts of an array (lets say it's stored in Fortran order), then
> the variable sized block nature may be complicating the tiling
> pattern.
- My hunch was that you'd always get something C-contiguous and the
exporter would copy/transpose. But perhaps one wouldn't loose much by
making the block strided and have the consumer do a copy; isn't that
mostly about which side of the boundary to call the copying utility?
- I think iterators over blocks are necesarry as well
- Not sure about variable-sized blocks. The array could already be
stored in cache-sized blocks, and the numerical loops could be optimized
for some block sizes, so there must certainly be some handshaking
mechanism; hopefully it can be made rather elegant. See next point for
one ingredient.
- What must be established is whether a) one should always copy, b)
one can copy if one wants to at a negligible cost, or c) computational
core reaching into the buffer of an exporter is better. This would
trickle down into lots of other decisions I feel (like if you always go
for a rectangular block, or whether getting entire rows like with the
NumPy iteration API is sometimes more valuable).
In addition to the Fortran compilers, one thing to study is
GotoBLAS2/OpenBLAS, which has code for getting a "high-quality memory
region", since that apparently made an impact. Not sure how much Intel
platforms though; other platforms have slower TLBs, in fact I think on
non-Intel platforms, TLB is what affected block size, not the cache size.
>
>> Let's please discuss the details *after* the fundamentals. But the reason I
>> put void* there instead of PyObject* is that I hope this could be used
>> beyond the Python world (say, Python<->Julia); the void* would be handed to
>> you at the time you receive the vtable (however we handle that).
>
> Yes, we should definitely not stick to objects here.
In fact, if we can get this done right, I'd have uses for such an API in
my pure C/C++ libraries that I want to be usable from Fortran without a
Python dependency...
>
>> I think this would fit neatly in Mark F.'s GSoC (Mark F.?), because you
>> could embed the block-transposition that's needed for efficient "arr +
>> arr.T" at this level.
>>
>> Imagine being able to do this in Cython:
>>
>> a[...] = b + c * d
>>
>> and have that essentially compile to the numexpr blocked approach, *but*
>> where b, c, and d can have whatever type that exports CEP 1001? So c could
>> be a "diagonal" array which uses O(n) storage to export O(n^2) elements, for
>> instance, and the unrolled Cython code never needs to know.
>
> I assume random accesses will happen through some kind of
> B-tree/sparse index like mechanism?
Here I don't follow. Accesses happens polymorphically, in whatever way
the array in question wants to. I already wrote some examples in the
email to Mark W.; on the computational side, there's certainly a lot of
different sparse formats, regular C/Fortan, blocked formats (more
efficient linear algebra), volume-filling fractals (for spatial locality
in N dimensions, probably not that helpful in single-node contexts
though)...
Dag
More information about the NumPy-Discussion
mailing list