[Numpy-discussion] Re: Bytes Object and Metadata

Mon Mar 28 16:00:14 EST 2005

>I wish I had time to do a good writeup, but I need to catch a flight in a
>couple hours, and I won't be back behind my computer until Wednesday night.
> Here is an initial stab:
>
>  __array_shape__   
>       Required, a sequence (typically tuple) of non-negative int/longs
>  
>
great.  I agree.

>  __array_storage__
>       Required, a buffer or possibly sequence object (list)
>
>       (Required unless the object support PyBufferProcs directly?
>        I don't have a strong opinion on that one...)
>
>       A slightly different name to indicate it could be a buffer or
>       sequence object (like a list).  Typically buffer.
>  
>
I prefer  __array_data__   (it's a common name for Numeric and 
numarray,  It can be interpreted as a sequence object if desired). 

>  __array_itemtype__
>       Suggested, but Optional if __array_itemsize__ is present.
>  
>
I say this one defaults to "V"  for void * if not present.   And 
_array_itemsize__ is necessary if it is "S" (string), "U" unicode, or "V".

I also like __array_typestr__  or __array_typechar__ better as a name.

>       A struct module format string or one of the additional ones
>       that needs to be added.  Need to discuss "long double" and
>       "Object".  (Capital 'O' for Object, Captial 'D' for long double,
>       Capital 'X' for bit?)
>  
>
Don't like 'D' for long double.  Complex floats is already using it.  
I'm not sure I like the idea of moving to two character typecodes at 
this point because it indicates more internal changes to Numeric3 
(otherwise we have two typecharacter standards which is not a good 
thing).  What is wrong with 'g' and 'G'  for long double and complex 
long double respectively. 

>       If not present or the empty string '', indicates that the
>       array elements can only be treated as blobs and the real
>       data representation must be gotten from some other means.
>  
>
Again, a void * type handles this well.

>       The struct module convention for denoting native, portable
>       big endian, and portable little endian is concise and documented.
>  
>
So, you think we should put the byte-order in the typecharacter 
interface.   Don't know.... could be persuaded.

>  __array_itemsize__
>       Optional if __array_itemtype is present and the value can 
>       calculated from struct.calcsize(__array_itemtype__)
>  
>
I think it is only optional if typechar is not 'S', 'U', or 'V'.

>  __array_strides__
>       Optional if the array data is in a contiguous C layout.
>       Required otherwise.  Same length as __array_shape__.
>       Indicates how much to multiply subscripts by to get to
>       the desired position in the storage.
>
>       A sequence (typically tuple) of ints/longs.  These are in
>       byte offsets (not element_size offsets) for most arrays.
>       Special exceptions made for:
>           Tightly packed (8 bits to a byte) bitmask arrays, where
>           they offsets are bit indexes
>
>           PyObject arrays (lists) where the offsets are indexes
>
>       They should be byte offsets to handle non-aligned data or data
>       with odd packing.
>     
>       Fortran arrays might be common enough to warrant special casing.
>       We could discuss whether a __array_fortran__ attribute indicates
>       that the array is in contiguous Fortran layout
>  
>
I don't think it is necessary in the interface.

>  __array_offset__
>       Optional and defaults to zero.  An int/long indicating the offset
>       to treat as the zeroth element
>
>  __array_complicated__
>       Optional and defaults to zero/false.  This is a kluge to indicate
>       that while yes the data is an array, the storage layout can not
>       be easily described by the shape/strides/offset combination alone.
>
>       This could warrant some discussion.
>  
>
I don't see the utility here I guess,  If it can't be described by a 
shape/strides combination then how can it participate in the protocol?

>  __array_fortran__
>       Optional and defaults to zero/false.  If you want to represent
>       Fortran arrays without creating a strides for them, this would
>       be necessary.  I'd vote to leave it out and stick with strides...
>
>  
>
Me too.  We should make the interface as minimal as possible, intially.

My proposal:

__array_data__  (optional object that exposes the PyBuffer protocol or a 
sequence object, if not present, the object itself is used).
__array_shape__ (required tuple of int/longs that gives the shape of the 
array)
__array_strides__ (optional provides how to step through the memory in 
bytes (or bits if a bit-array), default is C-contiguous)
__array_typestr__ (optional struct-like string showing the type --- 
optional endianness indicater + Numeric3 typechars, default is 'V')
__array_itemsize__ (required if above is 'S', 'U', or 'V')
__array_offset__ (optional offset to start of buffer, defaults to 0)

So, you could define an array interface with only two additional 
attributes if your object exposed the buffer or sequence protocol. 

We should figure out a way to work around the 32-bit limitations of the 
sequence and buffer protocols as well. 

-Travis