[Numpy-discussion] Bytes Object and Metadata

Fri Mar 25 22:59:04 EST 2005

Adding metadata at the buffer object level causes problems for "view"
semantics.  Let's say that everyone agreed what "itemsize" and "itemtype"
meant:

    real_view = complex_array.real

The real_view will have to use a new buffer since they can't share the old
one.  The buffer used in complex_array would have a typecode like
ComplexDouble and an itemsize of 16.  The buffer in real_view would need a
typecode of Double and an itemsize of 8.  If metadata is stored with the
buffer object, it can't be the same buffer object in both places.

Another case would be treating a 512x512 image of 4 byte pixels as a
512x512x4 image of 1 byte RGBA elements.  Or even coercing from Signed to
Unsigned.

The bytes object as proposed does allow new views to be created from other
bytes objects (sharing the same memory underneath), and these views could
each have separate metadata, but then you wouldn't be able to have arrays
that used other types of buffers.  Having arrays use mmap buffers is very
useful.

The bytes object shouldn't create views from arbitrary other buffer objects
because it can't rely on the general semantics of the PyBufferProcs
interface.  The foreign buffer object might realloc and invalidate the
pointer for instance...  The current Python "buffer" builtin does this, and
the results are bad.  So creating a bytes object as a view on the mmap
object doesn't work in the general case.

Actually, now that I think about it, the mmap object might be safe.  I
don't believe the current implementation of mmap does any reallocing under
the scenes and I think the pointer stays valid for the lifetime of the
object.  If we verified that mmap is safe enough, bytes could make a
special case out of it, but then you would be locked into bytes and mmap
only.  Maybe that's acceptable...

Still, I think keeping the metadata at a different level, and having the
bytes object just be the Python way to spell a call to C's malloc will
avoid a lot of problems.  Read below for how I think the metadata stuff
could be handled.

--- Chris Barker <Chris.Barker at noaa.gov> wrote:
> 
> There are any number of Third party extensions that could benefit from 
> being able to directly read the data in Numeric* arrays: PIL, wxPython, 
> etc. Etc. My personal example is wxPython:
> 
> At the moment, you can pass a Numeric or numarray array into wxPython, 
> and it will be converted to a wxList of wxPoints (for instance), but 
> that is done by using the generic sequence protocol, and a lot of type 
> checking. As you can imagine, that is pretty darn slow, compared to just 
> typecasting the data pointer and looping through it. Robin Dunn, quite 
> reasonably, doesn't want wxPython to depend on Numeric, so that's what 
> we've got.
> 
> My understanding of this memory object is that an extension like 
> wxPython wouldn't not need to know about Numeric, but could simply get 
> the memory Object, and there would be enough meta-data with it to 
> typecast and loop through the data. I'm a bit skeptical about how this 
> would work. It seems that the metadata required would be the full set of 
> stuff in an array Object already:
> 
> type
> dimensions
> strides
> 
> This could be made a bit simpler by allowing only contiguous arrays, but 
> then there would need to be a contiguous flag.
> 
> To make use of this, wxPython would have to know a fair bit about 
> Numeric Arrays anyway, so that it can check to see if the data is 
> appropriate. I guess the advantage is that while the wxPython code would 
> have to know about Numeric arrays, it wouldn't have to include Numeric 
> headers or code.
> 

I think being able to traffic in N-Dimensional arrays without requiring
linking against the libraries is a good thing.

Several years ago, I proposed a solution to this problem.  Actually I did a
really poor job of proposing it and irritated a lot of people in the
process.  I'm embarrassed to post a link to the following thread, but here
it is anyway:

    http://aspn.activestate.com/ASPN/Mail/Message/numpy-discussion/1166013

Accept my appologies if you read the whole thing just now.  :-)  Accept my
sincere appologies if you read it at the time.

I think the proposal is still relevant today, but I might revise it a bit
as follows.  A bear minimum N-Dimensional array for interchanging data
across libraries could get by with following attributes:

    # Create a simple record type for storing attributes
    class BearMin: pass
    bm = BearMin()

    # Set the attributes sufficient to describe a simple ndarray
    bm.buffer = <a buffer or sequence object>
    bm.shape = <a tuple of ints describing it's shape>
    bm.itemtype = <a string describing the elements>

The bm.buffer and bm.shape attributes are pretty obvious.  I would suggest
that the bm.itemtype borrow it's typecodes from the Python struct module,
but anything that everyone agreed on would work.  (The struct module is
nice because it is already documented and supports native and portable
types of many sizes in both endians.  It also supports composite struct
types.)

Those attributes are sufficient for someone to *produce* an N-Dimensional
array that could be understood by many libraries.  Someone who *consumes*
the data would need to know a few more:

    bm.offset = <an integer offset into the buffer>
    bm.strides = <a tuple of ints for non-contiguous or Fortran arrays>

The value of bm.offset would default to zero if it wasn't present, and the
the tuple bm.strides could be generated from the shape assuming it was a C
style array.  Subscripting operations that returned non-contiguous views of
shared data could change bm.offset to non-zero.  Subscripting would also
affect the bm.strides, and creating a Fortran style array would require
bm.strides to be present.

You might also choose to add bm.itemsize in addition to bm.itemtype when
you can describe how big elements are, but you can't sufficiently describe
what the data is using the agreed upon typecodes.  This would be uncommon. 
The default for bm.itemsize would come from struct.calcsize(bm.itemtype).

You might also choose to add bm.complicated for when the array layout can't
be described by the shape/offset/stride combination.  For instance
bm.complicated might get used when creating views from more sophisticated
subscripting operations like index arrays or mask arrays.  Although it
looks like Numeric3 plans on making new contiguous copies in those cases.

The C implementations of arrays would only have to add getattr like
methods, and the data could be stored very compactly.

>From those minimum 5-7 attributes (metadata), an N-Dimensional array
consumer could determine most everything it needed to know about the data. 
Simple routines could determine things like iscontiguous(bm), iscarray(bm)
or isfortran(bm).  I expect libraries like wxPython or PIL could punt
(raise an exception) when the water gets too deep.

It also doesn't prohibit other attributes from being added.  Just because
an N-Dimensional array described it's itemtype using the struct module
typecodes doesn't mean that it couldn't implement more sophisticated typing
hierarchies with a different attribute.

There are a few commonly used types like "long double" which are not
supported by the struct module, but this could be addressed with a little
discussion.  Also you might want a "bit" or "Object" typecode for tightly
packed mask arrays and Object arrays.

The names could be argued about, and something like:

    bm.__array_buffer__
    bm.__array_shape__
    bm.__array_itemtype__
    bm.__array_offset__
    bm.__array_strides__
    bm.__array_itemsize__
    bm.__array_complicated__

would really bring home the notion that the attributes are a description of
what it means to participate in an N-Dimensional array protocol.  Plus
names this long and ugly are unlikely to step on the existing attributes
already in use by Numeric3 and Numarray.  :-)

Anyway, I proposed this a long time ago, but the belief was that one of the
standard array packages would make it into the core very soon.  With a
standard array library in the core, there wouldn't be as much need for
general interoperability like this.  Everyone could just use the standard.

Maybe that position would change now that Numeric3 and Numarray both look
to have long futures.  Even if one package made it in, the other is likely
to live on.  I personally think the competition is a good thing.  We don't
need to have only one array package to get interoperability.

I would definitely like to see the Python core acquire a full fledged array
package like Numeric3 or Numarray.  When I log onto a new Linux or MacOS
machine, the array package would just be there.  No installs, no hassle. 
But I still think a simple community agreed upon set of attributes like
this would be a good idea.

--- Peter Verveer <verveer at embl.de> wrote:
> 
> It think it would be a real shame not to support non-contiguous data. 
> It would be great if such a byte object could be used instead of 
> Numeric/numarray arrays when writing extensions. Then I could write C 
> extensions that could be made available very easily/efficiently to any 
> package supporting it without having to worry about the specific C api 
> of those packages. If only contiguous byte objects are supported that 
> byte object is not a good option anymore for implementing extensions 
> for Numeric unless I am prepared to live with a lot of copying of 
> non-contiguous arrays.
> 

I'm hoping I made a good case for a slightly different strategy above.  But
even if the metadata did go into the bytes object itself, the metadata
could describe a non-contiguous layout on top of the contiguous chunk of
memory.

There is another really valid argument for using the strategy above to
describe metadata instead of wedging it into the bytes object:  The Numeric
community could agree on the metadata attributes and start using it
*today*.  

If you wait until someone commits the bytes object into the core, it won't
be generally available until Python version 2.5 at the earliest, and any
libraries that depended on using bytes stored metadata would not work with
older versions of Python.

Cheers,
    -Scott