[Python-Dev] Allocation of shape and strides fields in Py_buffer

Wed Dec 10 12:49:47 CET 2008

Antoine Pitrou wrote:
> In all honesty, I admit I am annoyed by all the problems with the buffer API /
> memoryview object, many of which are caused by its utterly bizarre design (and
> the fact that the design team went missing in action after imposing such a
> bizarre and complex design on us), and I'm reluctant to add yet another level of
> byzantine complexity in order to solve those problems. It explains I may sound a
> bit angry at times :-)
> 
> If we really need to change things a lot to make them work, we should re-work
> the buffer API from the ground up, make the Py_buffer struct a true PyObject
> (that is, a true variable-length object so as to solve the shape and strides
> allocation issue) and merge it with the current memoryview implementation. It
> would make things both more simpler and more flexible.

I don't see anything wrong with the PEP 3118 protocol. It does exactly
what it is designed to do: allow the number crunching crowd to share
large datasets between different libraries without copying things around
in memory. Yes, the protocol is complicated, but that is because it is
trying to handle a complicated problem.

The memoryview implementation on the other hand is pretty broken. I do
have a theory on how it ended up in such an unusable state, but I'm not
particularly inclined to share it - this kind of thing can happen
sometimes, and the important question now is how we fix it.

As I see it, memoryview is actually trying to do two things, but the
design for supporting the second of them doesn't appear to have been
adequately thought through in the current implementation.

The first use of a memoryview object is merely to allow access to the
Py_buffer of a data store. This is pretty simple, and aside from
currently getting len() wrong when itemsize > 1, memoryview isn't
terrible at it.

If we left memoryview at that it *would* just be a simple wrapper around
a Py_buffer struct, and it's implementation wouldn't be difficult at all.

Where it gets a bit more complicated is if we want to support slices
(rather than just indexing) on memoryview objects. When you do that, the
memoryview is no longer a simple wrapper around the Py_buffer of the
underlying data store, because it isn't exposing the whole data store
any more - it is only exposing part of it.

Requesting access to only part of a data buffer is NOT part of the PEP
3118 API, and it doesn't need to be: it can be part of a separate object
that adapts from the underlying data store to the desired subview.

The object that is meant to be performing at least simple 1-dimensional
cases of that adaptation is memoryview (or more to the point, memoryview
slices), but it currently *sucks* at this because it relies too heavily
on the info in the Py_buffer that it got from the underlying object.
That Py_buffer describes the *whole* data store, but a memoryview slice
may only be exposing part of it - so while the info in the Py_buffer is
accurate for the underlying object, it is *not* accurate for the
memoryview itself.

Fixing that for the 1 dimensional case shouldn't actually be all that
difficult - the memoryview just needs to maintain its own shape[0] entry
that reflects the number of items in the view rather than the number in
the underlying object.

The multi-dimensional cases get pretty tricky though, since they will
almost always end up dealing with non-contiguous data. The PEP 3118
protocol is up to handling the task, but the implementation of the index
mapping to handle these multi-dimensional cases is highly non-trivial,
and probably best left to third party libraries like numpy.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------