[Python-Dev] Extended Buffer Interface/Protocol

Tue Mar 27 01:52:03 CEST 2007

Carl Banks wrote:

> Tr
> ITSM that we are using the word "view" very differently.  Consider 
> this example:
>
> A = zeros((100,100))
> B = A.transpose()

You are thinking of NumPy's particular use case.  I'm thinking of a 
generic use case.  So, yes I'm using the word view in two different 
contexts.

In this scenario, NumPy does not even use the buffer interface.  It 
knows how to transpose it's own objects and does so by creating a new 
NumPy object (with it's own shape and strides space) with a data buffer 
pointed to by "A".

Yes, I use the word "view" for this NumPy usage, but only in the context 
of NumPy.   In the PEP, I've been using the word "view" quite a bit more 
generically.

So, I don't think this is a good example because A.transpose() will 
never call getbuffer of the A object (it will instead use the known 
structure of NumPy directly).  So, let's talk about the generic 
situation instead of the NumPy specific one.

>
> I'd suggest the object returned by A.getbuffer should be called the 
> "buffer provider" or something like that.

I don't care what we call it.  I've been using the word "view" because 
of the obvious analogy to my use of view in NumPy.  When I had 
envisioned returning an actual object very similar to a NumPy array from 
the buffer interface it made a lot of sense to call it a view.  Now, I'm 
fine to call it "buffer provider"

>
> For the sake of discussion, I'm going to avoid the word "view" 
> altogether.  I'll call A the exporter, as before.  B I'll refer to as 
> the requestor.  The object returned by A.getbuffer is the provider.

Fine.  Let's use that terminology since it is new and not cluttered by 
other uses in other contexts.

> Having thought quite a bit about it, and having written several 
> abortive replies, I now understand it and see the importance of it.  
> getbuffer returns the object that you are to call releasebuffer on.  
> It may or may not be the same object as exporter.  Makes sense, is 
> easy to explain.

Yes, that's exactly all I had considered it to be.   Only now, I'm 
wondering if we need to explicitly release a lock on the shape, strides, 
and format information as well as the buffer location information.

>
> It's easy to see possible use cases for returning a different object.  
> A  hypothetical future incarnation of NumPy might shift the 
> responsibility of managing buffers from NumPy array object to a hidden 
> raw buffer object.  In this scenario, the NumPy object is the 
> exporter, but the raw buffer object the provider.
>
> Considering this use case, it's clear that getbuffer should return the 
> shape and stride data independently of the provider.  The raw buffer 
> object wouldn't have that information; all it does is store a pointer 
> and keep a reference count.  Shape and stride is defined by the exporter.

So, who manages the memory to the shape and strides and isptr arrays?   
When a provider is created do these need to be created so that the shape 
and strides arrays are never deallocated when in use. 

The situation I'm considering is  if you have a NumPy array of shape 
(2,3,3) which you then obtain a provider of  (presumably from another 
package) and it retains a lock on the memory for a while.  Should it 
also retain a lock on the shape and strides array?   Can the NumPy array 
re-assign the shape and strides while the provider has still not been 
released?

I would like to say yes, which means that the provider must supply it's 
own copy of shape and strides arrays.  This could be the policy.  
Namely, that the provider must supply the memory for the shape, strides, 
and format arrays which is guaranteed for as long as a lock is held.  In 
the case of NumPy, that provider could create it's own copy of the shape 
and strides arrays (or do it when the shape and strides arrays are 
re-assigned).

>
>>> Second question: what happens if a view wants to re-export the 
>>> buffer? Do the views of the buffer ever change?  Example, say you 
>>> create a transposed view of a Numpy array.  Now you want a slice of 
>>> the transposed array.  What does the transposed view's getbuffer 
>>> export?
>>
>>
>> Basically, you could not alter the internal representation of the 
>> object while views which depended on those values were around.
>>
>> In NumPy, a transposed array actually creates a new NumPy object that 
>> refers to the same data but has its own shape and strides arrays.
>>
>> With the new buffer protocol, the NumPy array would not be able to 
>> alter it's shape/strides/or reallocate its data areas while views 
>> were being held by other objects.
>
>
> But requestors could alter their own copies of the data, no?  Back to 
> the transpose example: B itself obviously can't use the same "strides" 
> array as A uses.  It would have to create its own strides, right?

I don't like this example because B does have it's own strides because 
it is a complete NumPy array.   I think we are talking about the same 
thing and that is "who manages the memory" for the shape and strides 
(and format). 

I think the easiest solution is to say that the "provider" does.  It 
will manage that memory until the lock is released.  How the exporter 
object handles that is up to the exporter.  

>
> So, what if someone takes a slice out of B?  When calling B.getbuffer,
> does it get B's strides, or A's?
>
> I think it should get B's.  After all, if you're taking a slice of B, 
> don't you care about the slicing relative to B's axes?  I can't really 
> think of a use case for exporting A's stride data when you take a 
> slice of B, and it doesn't seem to simplify memory management, because 
> B has to make it's own copies anyways.

We already have an implementation in NumPy for memory sharing and in 
that implementation each NumPy array controls it's own shape and strides 
and only the data-location is shared. 

I don't like the terminology of "re-exporting" because it is 
unnecessarily confusing to me.   Naturally an object can both consume 
and export the buffer interface.  If it does this, then it will want to 
make it's own copies of the shape and strides arrays so as not to rely 
on the "provider" for these.

>
> Here's what I think: the lock should only apply to the buffer itself, 
> and not to shape and stride data at all.  If the requestor wants to 
> keep its own copies of the data, it would have to malloc its own 
> storage for it.  I expect that this would be very rare.

What I'm worried about is un-necessarily preventing an exporter from 
changing its shape and strides because a consumer is holding a lock to 
it's memory (but wouldn't need a lock to it's shape and strides because 
it made it's own copy).  How does the consumer signal that situation to 
the provider?

So, we can't avoid thinking about a lock on the shape and stride data.  
We can make a policy that the provider always owns the memory for the 
shape and stride and will not alter it until the lock is released.  Or 
we can say that the shape and stride information could change before the 
lock on the memory is released if the GIL is released (so make your own 
copies if you need to keep track of shape and stride).   Or we could 
have a separate lock on shape/stride.  But, we have to make some policy.

>
> As for the provider; I think that's between it the exporter.  If the 
> exporter and provider know about each other, they shouldn't have any 
> problems managing memory together.

Sure, what the provider and exporter do is up to the exporting object.

>
>
>> Having such a thing as a view object would actually be nice because 
>> it could hold on to a particular view of data with a given set of 
>> shape and strides (whose memory it owns itself) and then the 
>> exporting object would be free to change it's shape/strides 
>> information as long as the data did not change.
>
>
> What I don't undestand is why it's important for the provider to 
> retain this data.  The requestor only needs the information when it's 
> created; it will calculate its own versions of the data, and will not 
> need the originals again, so no need to the provider to keep them around.

That is certainly a policy we could enforce (and pretty much what I've 
been thinking).  We just need to make it explicit that the shape and 
strides provided is only guaranteed up until a GIL release (i.e. 
arbitrary Python code could change these memory areas both their content 
and location) and so if you need them later, make your own copies.

If this were the policy, then NumPy could simply pass pointers to its 
stored shape and strides arrays when the buffer interface is called but 
then not worry about re-allocating these arrays before the "buffer" lock 
is released.   Another object could hold on to the memory area of the 
NumPy array but would have to store shape and strides information if it 
wanted to keep it. 

NumPy could also just pass a pointer to the char * representation of the 
format (which in NumPy would be stored in the dtype object) and would 
not have to worry about the dtype being re-assigned later.

>
>>> The reason I ask is: if things like "buf" and "strides" and "shape" 
>>> could change when a buffer is re-exported, then it can complicate 
>>> things for PIL-like buffers.  (How would you account for offsets in 
>>> a dimension that's in a subarray?)
>>
>>
>> I'm not sure what you mean, offsets are handled by changing the 
>> starting location of the pointer to the buffer.
>
>
>
> But to anwser your question: you can't just change the starting 
> location because there's hundreds of buffers.  You'd either have to 
> change the starting location of each one of them, which is highly 
> undesirable, or to factor in an offset somehow.  (This was, in fact, 
> the point of the "derefoff" term in my original suggestion.)

I get better what your derefoff was doing now.  I was missing the 
de-referencing that was going on.   Couldn't you still just store a 
pointer to the start of the array.  In other words, isn't your **isptr  
suggestion sufficient?   It seems to be.

>
>
> Anyways, despite the miscommunications so far, I now have a very good 
> idea of what's going on.  We definitely need to get terms straight.  I 
> agree that getbuffer should return an object.  I think we need to 
> think harder about the case when requestors re-export the buffer.  
> (Maybe it's time to whip up some experimental objects?)

I'm still not clear what you are concerned about.   If an object 
consumes the buffer interface and then wants to be able to later export 
it to another, then from our discussion about the shape/strides and 
format information, it would have to maintain it's own copies of these 
things, because it could not rely on the original provider (or exporter) 
to keep them around once the GIL is released.

This is the reason we would have to be very clear about the guaranteed 
persistance of the shape/strides and format memory whose pointers are 
returned through the proposed buffer interface.

Thanks for the discussion.  It is nice to have someone to talk with 
about these things.   A conversation always results in a better 
implementation.

-Travis