[Numpy-discussion] Re: Bytes Object and Metadata

Sun Mar 27 18:08:13 EST 2005

Hi Travis.

I'm quite possibly misunderstanding how you want to incorporate the
metadata into the bytes object, so I'm going to try and restate both of our
positions from the point of view of a third party who will be using
ndarrays.  Let's take Chris Barker's point of view with regards to
wxPython...

We all roughly agree which pieces of metadata are needed for arrays.  There
are a few persnicketies, and the names could vary.  I'll use your given
names:

    .data     (could be .buffer or .__array_buffer__)
    .shape    (could be .dimensions or .__array_shape__)
    .strides  (maybe .__array_strides__)
    .itemtype (coulb be .typecode or .__array_itemtype__)

Several other attributes can be derived (calculated) from those (isfortran,
iscontiguous, etc...), and we might need a few more, but we'll ignore those
for now.

In my proposal, Chris would write a routine like such:

    def version_one(a):
        data = a.data
        shape = a.shape
        strides = a.strides
        itemtype = a.itemtype
        # Cool code goes here

I believe you are suggesting Chris would write:

    def version_two(a):
        data = a
        shape = a.shape
        strides = a.strides
        itemtype = a.itemtype
        # Cool code goes here

Of if you have the .meta dictionary, Chris would write:

    def version_three(a):
        data = a
        shape = a.meta["shape"]
        strides = a.meta["strides"]
        itemtype = a.meta["itemtype"]
        # Cool code goes here

Of course Chris could save one line of code with:

    def version_two_point_one(data):
        shape = a.shape
        strides = a.strides
        itemtype = a.itemtype
        # Cool code goes here

If I'm mistaken about your proposal, please let me know.  However if I'm
not mistaken, I think there are limitations with version_two and
version_three.

First, most of the existing buffer objects do not allow attributes to be
added to them.  With version_one, Chris could have data of type
array.array, Numarray.memory, mmap.mmap, __builtins__.str, the new
__builtins__.bytes type as well as any other PyBufferProcs supporting
object (and possibly sequence objects like __builtins__.list).

With version_two and version_three, something more is required.  In a few
cases like the __builtins__.str type you could add the necessary attributes
by inheritance.

In other cases like the mmap.mmap, you could wrap it with a
__builtins__.bytes object.  (That's assuming that __builtins__.bytes knows
how to wrap mmap.mmap objects...)

However, other PyBufferProcs objects like array.array will never allow
themselves to be wrapped by a __builtins__.bytes since they realloc their
memory and violate the promises that the __builtins__.bytes object makes. 
I think you disagree with me on this part, so more on that later in this
message.

For now I'll take your side, let's pretend that all PyBufferProcs
supporting objects could be made well enough behaved to wrap up in a
__builtins__.bytes object.  Do you really want to require that only
__builtins__.bytes objects are suitable for data interchange across
libraries?  This isn't explicitly stated by you, but since the
__builtins__.bytes object is the only common PyBufferProcs supporting
object that could define the metadata attributes, it would be the rule in
practice.  I think you're losing flexibility if you do it this way.  From
Chris's point of view it's basically the same amount of code for all three
versions above. 

Another consideration that might sway you is that the existing
N-Dimensional array packages could easily add attribute methods to
implement the interface, and they could do this without changing any part
of their implementation.  The .data attribute when requested would call a
"get method" that returns a buffer.  This allows user defined objects which
do not implement the PyBufferProcs protocol themselves, but which contain a
buffer inside of them to participate in the "ndarray protocol".  Both
version_two and version_three do not allow this - the object being passed
must *be* a buffer.

> 
> > The bytes object shouldn't create views from arbitrary other buffer
> > objects because it can't rely on the general semantics of the
> > PyBufferProcs interface.  The foreign buffer object might realloc
> > and invalidate the pointer for instance...  The current Python 
> > "buffer" builtin does this, and the results are bad.  So creating
> > a bytes object as a view on the mmap object doesn't work in the
> > general case.
> >  
> >
> This is a problem with the objects that expose the buffer interface.   
> The C-API could be more clear that you should not "reallocate" memory if 
> another array is referencing you.  See the arrayobject's resize method 
> for an example of how Numeric does not allow reallocation of the memory 
> space if another object is referencing it.     I suppose you could keep 
> track separately in the object of when another object is using your 
> memory, but the REFCOUNT works for this also (though it is not so 
> specific, and so you would miss cases where you "could" reallocate but 
> this is rarely used in arrayobject's anyway).
>

The reference count on the PyObject pointer is different than the number of
users using the memory.  In Python you could have:

    import array
    a = array.array('d', [1])
    b = a

The reference count on the array.array object is 2, but there are 0 users
working with the memory.  Given the existing semantics of the array.array
object, it really should be allowed to resize in this case.  Storing the
object in a dictionary would be another common situation that would
increase it's refcount but shouldn't lock down the memory.

A good solution to this problem was presented with PEP-298, but no progress
seems to have been made on it.

    http://www.python.org/peps/pep-0298.html

To my memory, PEP-298 was in response to PEP-296.  I proposed PEP-296 to
create a good working buffer (bytes) object that avoided the problems of
the other buffer objects.  Several folks wanted to fix the other (non
bytes) objects where possible, and PEP-298 was the result.  A strategy like
this could be used to make array.array safe outside of the GIL.  Bummer
that it didn't get implemented.

> 
> Another idea is to fix the bytes object so it always regrabs the pointer 
> to memory from the object instead of relying on the held pointer in view 
> situations.
>

A while back, I submitted a patch [552438] like this to fix the
__builtins__.buffer object:

http://sourceforge.net/tracker/index.php?func=detail&aid=552438&group_id=5470&atid=305470

It was ignored for a bit, and during the quiet time I came to realize that
even if the __builtins__.buffer object was fixed, it still wouldn't meet my
needs.  So I proposed the bytes object, and this patch fell on the floor
(the __builtins__.buffer object is still broken).

The downside to this approach is that it only solves the problem for code
running with posession of the GIL.  It does solve the stale pointer problem
that is exposed by the __builtins__.buffer object, but if you release the
GIL in C code, all bets are off - the pointer can become stale again.

The promises that bytes tries to make about the lifetime of the pointer can
only be guaranteed by the object itself.  Just because bytes could wrap the
other object and grab the latest pointer when you need it doesn't mean that
the other object won't invalidate the pointer a split second later when the
GIL is released.  It is mere chance that the mmap object is well behaved
enough.  And even the mmap object can release it's memory if someone closes
the object - again leading to a stale pointer.

>
> Metadata is such a light-weight "interface-based" solution.  It could be 
> as simple as attributes on the bytes object.  I don't see why you resist 
> it so much.    Imaging defining a jpeg file by a single bytes object 
> with a simple EXIF header metadata string.     If the bytes object 
> allowed the "bearmin" attributes you are describing then that would be 
> one way to describe an array that any third-party application could 
> support as much as they wanted.
>

Please don't think I'm offering you resistance.  I'm only trying to point
out some things that I think you might have overlooked.  Lots of people
ignore my suggestions all the time.  You'd be in good company if you did
too, and I wouldn't even hold a grudge against you.

Now let me be argumentative.  :-)  I've listed what I consider the
disadvantages above, but I guess I don't see any advantages of putting the
metadata on the bytes object.  In what way is:

    jpeg = bytes(<some data source>)
    jpeg.exif = <EXIF header metadata string>

better than:

    class record: pass
    jpeg = record()
    jpeg.data = <some data source, possibly bytes or something else>
    jpeg.exif = <EXIF metadata string>

The only advantage I see if that yours is a little shorter, but in any real
application, you were probably going to define an object of some sort to
add all the methods needed.  And as I showed up in version_one,
version_two, and version_three above, it's basically the same number of
lines for the consumer of the data.

There is nothing stopping a PyBufferProcs object like bytes from supporting
version_one above:

    jpeg = bytes(<some data source>)
    jpeg.data = jpeg
    jpeg.exif = <EXIF header metadata string>

But non PyBufferProcs objects can't play with version_two or version_three.

Incidently, being able to add attributes to bytes means that it needs to
play nicely with the garbage collection system.  At that point, bytes is
basically a container for arbitrary Python objects.  That's additional
implementation headache.

> 
> It really comes down to being accepted by everybody as a standard.
> 

This I completely agree with.  I think the community will roll with
whatever you and Perry come to agree on.  Even the array.array object in
the core could be made to work either way.

If the decision you come up with makes it easy to add the interface to
existing array objects then everyone would probably adopt it and it would
become a standard.

This is the main reason I like the double underscore __*meta*__ names.  It
matches the similar pattern all over Python, and existing array packages
could add those without interfering with their existing implementation:

    class Numarray:
        #
        # lots of array implementing code
        #

        # Down here at the end, add the "well-known" interface
        # (I haven't embraced the @property decorator syntax yet.)

        def __get_shape(self):
            return self._shape
        __array_shape__ = property(__get_shape)

        def __get_data(self):
            # Note that they use a different name internally
            return self._buffer
        __array_data__ = property(__get_data)

        def __get_itemtype(self):
            # Perform an on the fly conversion from the class
            # hierarchy type to the struct module typecode that
            # closest matches
            return self._type._to_typecode()
        __array_itemtype__ = property(__get_itemtype)

Changing class Numarray to a PyBufferProcs supporting object would be
harder.

The C version for Numeric3 arrays would be similar, and there is no wasted
space on a per instance basis in either case.

>
> One of the things, I want for Numeric3 is to be able to create an array 
> from anything that exports the buffer interface.  The problem, of course 
> is with badly-written exentsion modules that rudely reallocate their 
> memory even after they've shared it with someone else.   Yes, Python 
> could be improved so that this were handled better, but it does work 
> right now, as long as buffer interface exporters play nice.  
> 

I think the behavior of the array.array objects are pretty defensible.  It
is useful that you can extend those arrays to new sizes.  For all I know,
it was written that way before there was a GIL.  I think PEP-298 is a good
way to make the dynamic buffers more GIL friendly.

>
> This is the way to advertise the buffer interface (and buffer 
> object).    Rather than vague references to buffer objects being a 
> "bad-design" and a blight we should say:  objects wanting to export the 
> buffer interface currently have restrictions on their ability to 
> reallocate their buffers.  
> 

I agree.  The "bad-design" type of comments about "the buffer problem" on
python-dev have always annoyed me.  It's not that hard of a problem to
solve technically.

>
> > I would suggest that the bm.itemtype borrow it's typecodes from
> > the Python struct module, but anything that everyone agreed on
> > would work.  
>
> I've actually tried to do this if you'll notice, and I'm sure I'll take 
> some heat for that decision at some point too.    The only difference 
> currently I think are long types (q and Q),  I could be easily persuaded 
> to change thes typecodes too.   I agree that the typecode characters are 
> very simple and useful for interchanging information about type.  That 
> is a big reason why I am not "abandoning them"
>

The real advantage to the struct module typecodes comes in two forms. 
First and most important is that it's already documented and in place - a
defacto standard.  Second is that Python script code could use those
typecodes directly with the struct module to pull apart pieces of data. 
The disadvantage is that a few new typecodes would be needed...

I would even go as far as to recommend their '>' '<' prefix codes for
big-endian and little-endian for just this reason...

>
> I don't like this offset parameter.  Why doesn't the buffer just start 
> where it needs too?
>

Well if you stick with using the bytes object, you could probably get away
with this.  Effectively, the offset is encoded in the bytes object.  At
this point, I don't know if anything I said above was pursuasive, but I
think there are other cases where you would really want this.  Does anyone
plan to support tightly packed (8 bits to a byte) bitmask arrays?  Object
arrays could be implemented on top of shared __builtins__.list objects, and
there is no easy way to create offset views into lists.

> 
> It would be an easy thing to return an ArrayObject from an object that 
> exposes those attributes (and a good idea).
>

This would be wonderful.  Third party libraries could produce data that is
sufficiently ndarray like without hassle, and users of that library could
promote it to a Numeric3 array with no headaches.

>
> So, I pretty much agree with what you are saying.  I just don't see how 
> this is at odds with attaching metadata to a bytes object.   
> 
> We could start supporting this convention today, and also handle bytes 
> objects with metadata in the future.
>

Unfortunately, I don't think any buffer objects exist today which have the
ability to dynamically add attributes.  If my arguments above are
unpursuasive, I believe bytes (once it is written) will be the only buffer
object with this support.

By the way, it looks like the "bytes" concept has been revisited recently. 
there is a new PEP dated Aug 11, 2004:

    http://www.python.org/peps/pep-0332.html

> 
> 
> > There is another really valid argument for using the strategy above to
> > describe metadata instead of wedging it into the bytes object:  The
> > Numeric community could agree on the metadata attributes and start 
> > using it *today*.  
> 
> I think we should just start advertising now, that with the new methods 
> of numarray and Numeric3, extension writers can right now deal with 
> Numeric arrays (and anything else that exposes the same interface) very 
> easily by using attribute access (or the buffer protocol together with 
> attribute access).    They can do this because Numeric arrays (and I 
> suspect numarrays as well) use the buffer interface responsibly (we 
> could start a political campaign encouraging responsible buffer usage 
> everywhere :-) ).
>

I can just imagine the horrible mascot that would be involved in the PR
campaign.

Thanks for your attention and patience with me on this.  I really
appreciate the work you are doing.  I wish I could explain my understanding
of things more clearly.

Cheers,
    -Scott