[Numpy-discussion] Re: Bytes Object and Metadata

Sat Mar 26 23:23:30 EST 2005

Scott Gilbert wrote:

>Adding metadata at the buffer object level causes problems for "view"
>semantics.  Let's say that everyone agreed what "itemsize" and "itemtype"
>meant:
>
>    real_view = complex_array.real
>
>The real_view will have to use a new buffer since they can't share the old
>one.  The buffer used in complex_array would have a typecode like
>ComplexDouble and an itemsize of 16.  The buffer in real_view would need a
>typecode of Double and an itemsize of 8.  If metadata is stored with the
>buffer object, it can't be the same buffer object in both places.
>  
>
This is where having "strides" metadata becomes very useful.   Then, 
real_view would not have
to be a copy at all, unless the coder didn't want to deal with it.  

>Another case would be treating a 512x512 image of 4 byte pixels as a
>512x512x4 image of 1 byte RGBA elements.  Or even coercing from Signed to
>Unsigned.
>  
>
Why not?  A different bytes object could point to the same memory but 
the different metadata would say "treat this data differently"

>
>The bytes object as proposed does allow new views to be created from other
>bytes objects (sharing the same memory underneath), and these views could
>each have separate metadata, but then you wouldn't be able to have arrays
>that used other types of buffers.  
>  
>
I don't see why not.  Your argument is not clear to me.

>The bytes object shouldn't create views from arbitrary other buffer objects
>because it can't rely on the general semantics of the PyBufferProcs
>interface.  The foreign buffer object might realloc and invalidate the
>pointer for instance...  The current Python "buffer" builtin does this, and
>the results are bad.  So creating a bytes object as a view on the mmap
>object doesn't work in the general case.
>  
>
This is a problem with the objects that expose the buffer interface.   
The C-API could be more clear that you should not "reallocate" memory if 
another array is referencing you.  See the arrayobject's resize method 
for an example of how Numeric does not allow reallocation of the memory 
space if another object is referencing it.     I suppose you could keep 
track separately in the object of when another object is using your 
memory, but the REFCOUNT works for this also (though it is not so 
specific, and so you would miss cases where you "could" reallocate but 
this is rarely used in arrayobject's anyway).

Another idea is to fix the bytes object so it always regrabs the pointer 
to memory from the object instead of relying on the held pointer in view 
situations.

>Still, I think keeping the metadata at a different level, and having the
>bytes object just be the Python way to spell a call to C's malloc will
>avoid a lot of problems.  Read below for how I think the metadata stuff
>could be handled.
>  
>
Metadata is such a light-weight "interface-based" solution.  It could be 
as simple as attributes on the bytes object.  I don't see why you resist 
it so much.    Imaging defining a jpeg file by a single bytes object 
with a simple EXIF header metadata string.     If the bytes object 
allowed the "bearmin" attributes you are describing then that would be 
one way to describe an array that any third-party application could 
support as much as they wanted.

In short, I think we are thinking along similar lines.

It really comes down to being accepted by everybody as a standard.

One of the things, I want for Numeric3 is to be able to create an array 
from anything that exports the buffer interface.  The problem, of course 
is with badly-written exentsion modules that rudely reallocate their 
memory even after they've shared it with someone else.   Yes, Python 
could be improved so that this were handled better, but it does work 
right now, as long as buffer interface exporters play nice.  

This is the way to advertise the buffer interface (and buffer 
object).    Rather than vague references to buffer objects being a 
"bad-design" and a blight we should say:  objects wanting to export the 
buffer interface currently have restrictions on their ability to 
reallocate their buffers.  

>>    
>>
>
>I think being able to traffic in N-Dimensional arrays without requiring
>linking against the libraries is a good thing.
>  
>
Several of us are just catching on to the idea.  Thanks for your patience.

>I think the proposal is still relevant today, but I might revise it a bit
>as follows.  A bear minimum N-Dimensional array for interchanging data
>across libraries could get by with following attributes:
>
>    # Create a simple record type for storing attributes
>    class BearMin: pass
>    bm = BearMin()
>
>    # Set the attributes sufficient to describe a simple ndarray
>    bm.buffer = <a buffer or sequence object>
>    bm.shape = <a tuple of ints describing it's shape>
>    bm.itemtype = <a string describing the elements>
>
>The bm.buffer and bm.shape attributes are pretty obvious.  I would suggest
>that the bm.itemtype borrow it's typecodes from the Python struct module,
>but anything that everyone agreed on would work.  
>  
>
I've actually tried to do this if you'll notice, and I'm sure I'll take 
some heat for that decision at some point too.    The only difference 
currently I think are long types (q and Q),  I could be easily persuaded 
to change thes typecodes too.   I agree that the typecode characters are 
very simple and useful for interchanging information about type.  That 
is a big reason why I am not "abandoning them"

>Those attributes are sufficient for someone to *produce* an N-Dimensional
>array that could be understood by many libraries.  Someone who *consumes*
>the data would need to know a few more:
>
>    bm.offset = <an integer offset into the buffer>
>  
>
I don't like this offset parameter.  Why doesn't the buffer just start 
where it needs too?

>    bm.strides = <a tuple of ints for non-contiguous or Fortran arrays>
>  
>
Things are moving this direction (notice that Numeric3 has attributes 
much like you describe), except we use the word .data (instead of .buffer)

It would be an easy thing to return an ArrayObject from an object that 
exposes those attributes (and a good idea).

So, I pretty much agree with what you are saying.  I just don't see how 
this is at odds with attaching metadata to a bytes object.   

We could start supporting this convention today, and also handle bytes 
objects with metadata in the future.

>There is another really valid argument for using the strategy above to
>describe metadata instead of wedging it into the bytes object:  The Numeric
>community could agree on the metadata attributes and start using it
>*today*.  
>  
>
Yes, but this does not mean we should not encourage the addition of 
metadata to bytes objects (as this has larger uses than just Numeric 
arrays).   It is not a difficult thing to support both concepts.

>If you wait until someone commits the bytes object into the core, it won't
>be generally available until Python version 2.5 at the earliest, and any
>libraries that depended on using bytes stored metadata would not work with
>older versions of Python.
>
>  
>

I think we should just start advertising now, that with the new methods 
of numarray and Numeric3, extension writers can right now deal with 
Numeric arrays (and anything else that exposes the same interface) very 
easily by using attribute access (or the buffer protocol together with 
attribute access).    They can do this because Numeric arrays (and I 
suspect numarrays as well) use the buffer interface responsibly (we 
could start a political campaign encouraging responsible buffer usage 
everywhere :-) ).

-Travis