[Python-3000] patch: bytes object PyBUF_LOCKDATA read-only and immutable support

Tue Sep 11 21:02:41 CEST 2007

On 9/10/07, Travis E. Oliphant <oliphant at enthought.com> wrote:
> Guido van Rossum wrote:
> > I'd like to see Travis's response to this. It's setting a precedent
> > regarding locking objects in read-only mode; I haven't found other
> > examples of objects using LOCKDATA (the only mentions of it seem to be
> > rejecting it :). I keep getting confused by the two separate lock
> > counts (and I think in this version the comment is inconsistent with
> > the code). So I'm hoping Travis has a particular way in mind of
> > handling LOCKDATA that can be used as a template.
> >
> > Travis?
>
> The use case I had in mind comes about quite often in NumPy when you
> want to modify the data-area of an object which may have a
> non-contiguous chunk of memory, but the algorithm being used expects
> contiguous data.  Imagine, for example, that the exporting object is an
> image whose rows are stored in different segments.
>
> The consumer of the buffer interface, however, may be an extension
> module that does fast image-processing operations and requires
> contiguous data.  Because it wants to write the results back in to the
> memory area when it is done with the algorithm (which may be thread-safe
> and may release the GIL), it requests the object to lock its data to
> read-only so that other consumers do not try to get writeable buffers
> while it is processing.
>
> When the algorithm is done, it alone can write to the memory area and
> then when it releases the buffer, the original object will restore
> itself to being writeable.  Of course, the exporting object must support
> this kind of operation and not all objects will.  I expect the NumPy
> array object and the PIL to support it for example, and other
> media-centric objects.

Hm, so this is completely different from what I thought. It seems you
are describing the following:

1. acquire the buffer with LOCK_DATA
2. copy the data out of the buffer into a scratch area
3. work on the scratch area
4. copy the data from the scratch area back into the buffer
5. release the buffer

i would call this an exclusive write lock, which is quite different
from the read lock interpretation implemented by Greg in his patch.
Could you add some language to PEP 3118 to clarify this usage? Or is
it already there? I admit to not having read it in full...

> It would probably be useful if the bytes object supported it because
> then other objects could use it as the memory area.    To do it
> correctly, the object exporting the interface must only allow locking if
> no other writeable interfaces have been exported (which it must keep
> track of) and then on release must check to see if the buffer that is
> being released is the one that locked its data.

Right. So it seems you would need a counter of outstanding
non-data-locked buffer requests and a single bit indicating whether
there's a data-locked request. (Rather than two counters like Greg's
patch currently uses.)

The hacker in me is already exploring the possibility of making the
count negative if there's a data-locked request; it sounds like the
valid transitions are:

0 -> 1 -> 2 -> ... (SIMPLE or WRITABLE get)
... -> 2 -> 1 -> ... (SIMPLE or WRITABLE release)
0 -> -1 (LOCKDATA get)
-1 -> 0 (LOCKDATA release)

Have I got that right? I think that you should only be able to request
LOCKDATA if there are no other readers *or* writers, but that SIMPLE
and WRITABLE clients should be able to coexist (any mess that creates
would be the requester's own fault). Any nonzero value here would
indicate that the buffer can't be moved.

I note that the use case in the bsddb wrapper extension is a bit
different -- Greg suspects that BerkeleyDB won't like the data
changing while it is using it (e.g. it might violate its own invariant
if the key changes between the time its hash is computed and the time
it is written to disk). To ensure this, currently LOCKDATA is the only
option; but a classic read lock would allow multiple concurrent
readers (which is how Greg's patch to bytesobject.c interprets
LOCKDATA).

I think this needs to be clarified. Perhaps we need to separate
clearer the type of access (read or write) and the amount of locking
desired (can others read? can others write?).

(BTW The current implementation in bytesobject.c allows changing the
size as long as it fits within the allocated size; I think this is
probably too lenient, and begging for latent bugs.)

(Spelling alert: 'writeable' is apparently not an English word. I hope
it's not too late to rename the flag to PyBUF_WRITABLE. I've opened
http://bugs.python.org/issue1150 to track this.)

> For a real-life example, NumPy has a flag called UPDATEIFCOPY that is a
> slightly different implementation of the concept.   When this flag is
> set during conversion to an array, then if a copy must be made to
> satisfy the requirements, the original array is set as read-only and
> this special flag is set on the array.  When the copy is deleted, its
> memory is automatically copied (and possibly casted, etc.) back into the
> original array.  It is a nice abstraction of the concept of an output
> data area that was borrowed from Numarray and allows many things to be
> implemented very quickly in NumPy.

So in terms of locks, this effectively sets read *and* write locks on
the original object (since whatever you might read out of it may be
invalidated when the modified copy is written back). But how to
enforce that at the Python level? If we had something like this for
the bytes object, any *use* of the bytes object from Python (e.g.
iterating over it or indexing or slicing it) should be prohibited. Is
this reasonable?

> One of the main things people use the NumPy C-API for is to get a
> contiguous chunk of memory from an array in order to do processing in
> another language (such as C or Fortran).   It is nice to be able to
> specify that the result gets placed back into another chunk of memory
> (which may or may not be contiguous) in a unified fashion.   NumPy
> handles all the copying for you.
>
> My thinking was that many people will want to be able to get contiguous
> chunks of memory, do processing, and then copy the result back into a
> segment of memory from a buffer-exporting object which is passed into
> the routine as an output object.

This is probably common for numpy; for the bytes object, I expect that
it's all much simpler, since it's just a contiguous 1D array of
bytes...

> I'm not sure if my explanations are helpful.  Please let me know if I
> can explain further.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)