[Python-Dev] PEP 3118: Extended buffer protocol (new version)

Thu Apr 19 06:40:28 CEST 2007

Carl Banks wrote:
> Ok, I've thought quite a bit about this, and I have an idea that I 
> think will be ok with you, and I'll be able to drop my main 
> objection.  It's not a big change, either.  The key is to explicitly 
> say whether the flag allows or requires.  But I made a few other 
> changes as well.
I'm good with using an identifier to differentiate between an "allowed" 
flag and a "require" flag.   I'm not a big fan of 
VERY_LONG_IDENTIFIER_NAMES though.  Just enough to understand what it 
means but not so much that it takes forever to type and uses up 
horizontal real-estate.

We use flags in NumPy quite a bit, and I'm obviously trying to adapt 
some of this to the general case here, but I'm biased by my 10 years of 
experience with the way I think about NumPy arrays.

Thanks for helping out and offering your fresh approach.   I like a lot 
of what you've come up with.  There are a few modifications I would 
make, though.

>
> First of all, let me define how I'm using the word "contiguous": it's 
> a single buffer with no gaps.  So, if you were to do this: 
> "memset(bufinfo->buf,0,bufinfo->len)", you would not touch any data 
> that isn't being exported.

Sure, we call this NPY_ONESEGMENT in NumPy-speak, though, because 
contiguous could be NPY_C_CONTIGUOUS or NPY_F_CONTIGUOUS.   We also 
don't use the terms ROW_MAJOR and COLUMN_MAJOR and so I'm not a big fan 
of bringing them up in the Python space because the NumPy community has 
already learned the C_ and F_ terminology which also generalizes to 
multiple-dimensions more clearly without using 2-d concepts.
>
> Without further ado, here is my proposal:
>
>
> ------
>
> With no flags, the PyObject_GetBuffer will raise an exception if the 
> buffer is not direct, contiguous, and one-dimensional.  Here are the 
> flags and how they affect that:

I'm not sure what you mean by "direct" here.  But, this looks like the 
Py_BUF_SIMPLE case (which was a named-constant for 0) in my proposal.    
The exporter receiving no flags would need to return a simple buffer 
(and it wouldn't need to fill in the format character either --- 
valuable information for the exporter to know).
>
> Py_BUF_REQUIRE_WRITABLE - Raise exception if the buffer isn't writable.
WRITEABLE is an alternative spelling and the one that NumPy uses.   So, 
either include both of these as alternatives or just use WRITEABLE.
>
> Py_BUF_REQUIRE_READONLY - Raise excpetion if the buffer is writable.
Or if the object memory can't be made read-only if it is writeable.
>
> Py_BUF_ALLOW_NONCONTIGUOUS - Allow noncontiguous buffers.  (This turns 
> on "shape" and "strides".)
>
Fine.
> Py_BUF_ALLOW_MULTIDIMENSIONAL - Allow multidimensional buffers.  (Also 
> turns on "shape" and "strides".)
Just use ND instead of MULTIDIMENSIONAL   and only turn on shape if it 
is present.
>
> (Neither of the above two flags implies the other.)
>

> Py_BUF_ALLOW_INDIRECT - Allow indirect buffers.  Implies 
> Py_BUF_ALLOW_NONCONTIGUOUS and Py_BUF_ALLOW_MULTIDIMENSIONAL. (Turns 
> on "shape", "strides", and "suboffsets".)
If we go with this consumer-oriented naming scheme, I like indirect also.
>
> Py_BUF_REQUIRE_CONTIGUOUS_C_ARRAY or Py_BUF_REQUIRE_ROW_MAJOR - Raise 
> an exception if the array isn't a contiguous array with in C 
> (row-major) format.
>
> Py_BUF_REQUIRE_CONTIGUOUS_FORTRAN_ARRAY or Py_BUF_REQUIRE_COLUMN_MAJOR 
> - Raise an exception if the array isn't a contiguous array with in 
> Fortran (column-major) format.
Just name them C_CONTIGUOUS and F_CONTIGUOUS like in NumPy.
>
> Py_BUF_ALLOW_NONCONTIGUOUS, Py_BUF_REQUIRE_CONTIGUOUS_C_ARRAY, and 
> Py_BUF_REQUIRE_CONTIGUOUS_FORTRAN_ARRAY all conflict with each other, 
> and an exception should be raised if more than one are set.
>
> (I would go with ROW_MAJOR and COLUMN_MAJOR: even though the terms 
> only make sense for 2D arrays, I believe the terms are commonly 
> generalized to other dimensions.)
As I mentioned there is already a well-established history with NumPy.  
We've dealt with this issue already.
>
> Possible pseudo-flags:
>
> Py_BUF_SIMPLE = 0;
> Py_BUF_ALLOW_STRIDED = Py_BUF_ALLOW_NONCONTIGUOUS
>                        | Py_BUF_ALLOW_MULTIDIMENSIONAL;
>
> ------
>
> Now, for each flag, there should be an associated function to test the 
> condition, given a bufferinfo struct.  (Though I suppose they don't 
> necessarily have to map one-to-one, I'll do that here.)
>
> int PyBufferInfo_IsReadonly(struct bufferinfo*);
> int PyBufferInfo_IsWritable(struct bufferinfo*);
> int PyBufferInfo_IsContiguous(struct bufferinfo*);
> int PyBufferInfo_IsMultidimensional(struct bufferinfo*);
> int PyBufferInfo_IsIndirect(struct bufferinfo*);
> int PyBufferInfo_IsRowMajor(struct bufferinfo*);
> int PyBufferInfo_IsColumnMajor(struct bufferinfo*);
>
> The function PyObject_GetBuffer then has a pretty obvious 
> implementation.  Here is an except:
>
>     if ((flags & Py_BUF_REQUIRE_READONLY) &&
>             !PyBufferInfo_IsReadonly(&bufinfo)) {
>         PyExc_SetString(PyErr_BufferError,"buffer not read-only");
>         return 0;
>     }
>
> Pretty straightforward, no?
>
> Now, here is a key point: for these functions to work (indeed, for 
> PyObject_GetBuffer to work at all), you need enough information in 
> bufinfo to figure it out.  The bufferinfo struct should be 
> self-contained; you should not need to know what flags were passed to 
> PyObject_GetBuffer in order to know exactly what data you're looking at.
Naturally.

>
>
> Therefore, format must always be supplied by getbuffer.  You cannot 
> tell if an array is contiguous without the format string.  (But see 
> below.)

No, I don't think this is quite true.   You don't need to know what 
"kind" of data you are looking at if you don't get strides.  If you use 
the SIMPLE interface, then both consumer and exporter know the object is 
looking at "bytes" which always has an itemsize of 1.
>
> And even if the consumer isn't asking for a contiguous buffer, it has 
> to know the item size so it knows what data not to step on.
>
> (This is true even in your own proposal, BTW.  If a consumer asks for 
> a non-strided array in your proposal, PyObject_GetBuffer would have to 
> know the item size to determine if the array is contiguous.)
Yes, it is true, that getting strides requires that the format be 
specified as well.  That was an oversight of the original proposal.   
But, if strides are not needed, then format is also not needed.
>
>
> ------
>
> FAQ:
>
> Q. Why ALLOW_NONCONTIGUOUS and ALLOW_MULTIDIMENSIONAL instead of 
> ALLOW_STRIDED and ALLOW_SHAPED?
>
> A. It's more useful to the consumer that way.  With ALLOW_STRIDED and 
> ALLOW_SHAPED, there's no way for a consumer to request a general 
> one-dimensional array (it can only request a non-strided 
> one-dimensional array), and requesting a SHAPED array but not a 
> STRIDED one can only return a C-like (row-major) array, although a 
> consumer might reasonably want a Fortran-like (column-major) array.  
> This approach maps more directly to the consumer's needs, is more 
> flexible, and still maintains the same functionality of ALLOW_SHAPED 
> and ALLOW_STRIDED.
>
>
> Q. Why call it ALLOW_INDIRECT instead of ALLOW_OFFSETS?
>
> A. It's just a name, and not too important to me, but I wanted to 
> emphasize the consumer's usage, rather than the benefit to the 
> exporter.  The consumers, after all, are the ones setting the flags.
>
>
> Q. Why ALLOW_NONCONTIGUOUS instead of REQUIRE_CONTIGUOUS?
>
> Two reasons: 1. Contiguous arrays are "simpler", so it's better to 
> make the people who want more complex arrays to work harder, and 2. 
> ALLOW_NONCONTIGUOUS is closely tied to ALLOW_MULTIDIMENSIONAL.  If the 
> negative is a problem, perhaps a name like ALLOW_DISCONTINUOUS or 
> ALLOW_GAPS would be better?
>
>
> Q. What about Py_BUF_FORMAT?
>
> A. Ok, fine, if it's that imporant to you.  I think it's totally 
> superfluous, but it's not evil.  But consider these things:
>
> 1. Require that it does not throw an exception.  It's not the 
> exporter's business to tell the consumer to how to use its data.
Look, consumers that want to be "in-charge" can just ask for format data 
and ignore it.   If an exporter wants to be persnickety about how its 
data is viewed, then it should be allowed to be.  Perhaps it has good 
reason.  It's just a matter of how much "work" it is to get the "wrong" 
view of the data.
>
> 2. Even if you don't supply the format string, you need to supply an 
> itemsize in struct bufferinfo, otherwise there is no way for a 
> consumer to determine if the array is contiguous, and or to know (in 
> general) what data is being exported.  The itemsize must ALWAYS be 
> available.
Only if strides is provided and format isn't is itemsize actually 
needed.  But, we've added the itemsize field anyway.
>
> 3. Invert Py_BUF_FORMAT.  Use Py_BUF_DONT_NEED_FORMAT instead.  Make 
> the consumer that cares about performance ask for the optimization.  
> (You admit yourself that Py_BUF_FORMAT is part of the least common 
> denominator, so invert it.)
Either way.  I think the Py_BUF_FORMAT is easier because then 
Py_BUF_SIMPLE is just a numerical value of 0.

I'll update the PEP with my adaptation of your suggestions in a little 
while.

-Travis