[Numpy-discussion] Array Metadata

Thu Mar 31 20:14:15 EST 2005

I got back late last night, and there were lots of things I wanted to
comment on.  I've put parts of several threads into this one message since
they're all dealing with the same general topic:

Perry Greenfield wrote:
>
> I'm not sure how the support for large data sets should be handled.
> I generally think that it will be very awkward to handle these
> until Python does as well. Speaking of which...
>

I agree that it's going to be difficult to have general support for large
PyBufferProcs objects until the Python core is made 64 bit clean.  But
specific support can be added for buffer types that are known in advance. 
For instance, the bytes object PEP proposes an alternate way to get a 64
bit length, and similar support could easily be added to Numarray.memory,
mmap.mmap, and whatever else on a case by case basis.  So you could get a
64 bit pointer from some types of buffers before the rest of Python becomes
64 bit clean.

If the ndarray consumer (wxWindows for instance) doesn't recognize the
particular implementation, it has to stick with the limitations of the
standard PyBufferProcs and assume a 32 bit length would suffice.

Travis Oliphant wrote:
>
> I prefer  __array_data__   (it's a common name for Numeric and
> numarray,  It can be interpreted as a sequence object if desired). 
>

So long as everyone agrees it doesn't matter what name it is.  Sounds like
__array_data__ works for everyone.

>
> I also like __array_typestr__  or __array_typechar__ better as a name.
>

A name is a name as far as I'm concerned.  The name __array_typestr__ works
for me.  The name __array_typechar__ implies a single character, and that
won't be true.

>
> Don't like 'D' for long double.  Complex floats is already
> using it.  I'm not sure I like the idea of moving to two
> character typecodes at this point because it indicates more
> internal changes to Numeric3 (otherwise we have two typecharacter
> standards which is not a good thing).  What is wrong with 'g'
> and 'G'  for long double and complex long double respectively. 
>

Nothing in this array protocol should *require* internal changes to either
Numeric3 or Numarray.  I suspect Numarray is going to keep it's type
hierarchy, and Numeric3 can use single character codes for it's
representation if it wants.  However, both Numeric3 and Numarray might
(probably would) have to translate their internal array type specifiers
into the agreed upon "type code string" when reporting out this attribute.

The important qualities __array_typestr__ should have are:

    1) Everyone should agree on the interpretation.  It needs to be
       documented somewhere.  Third party libraries should get the
       same __array_typestr__ from Numarray as they do from Numeric3.

    2) It should be sufficiently general in it's capabilities to
       describe a wide category of array types.  Simple things
       should be simple, and harder things should be possible.

       An ndarray of double should have a simple common well
       recognized value for __array_typestr__.  An ndarray of
       a multi-field structs should be representable too.

> 
> >
> >  __array_complicated__
> >
> 
> I don't see the utility here I guess,  If it can't be described by a 
> shape/strides combination then how can it participate in the protocol?
>

I'm not married to this one.  I don't know if Numarray or Numeric3 will
ever do such a thing, but I can imagine more complicated schemes of
arranging the data than offset/shape/strides are capable of representing. 
So this is forward compatibility with "Numarric4" :-).  Pretty
hypothetical, but imagine that typically Numarric4 can represent it's data
with offset/shape/strides, but for more advanced operations that falls
apart.  I could bore you with a detailed example...

The idea is that if array consumers like wxPython were aware that more
complicated implementations can occur in the future, they can gracefully
bow out and raise an exception instead of incorrectly interpreting the
data.  If you need it later, you can't easily add it after the fact.

Take it or leave it I guess - it's possibly a YAGNI.

> 
> After more thought,  I think here we need to also allow the
> "c-type" independent way of describing an array (i.e. numarray
> introduced 'c4' for a complex-valued 4 byte itemsize array).
> So, perhaps __array_ctypestr_  and __array_typestr__ should be
> two ways to get the information (or overload the __array_typestr__
> interface and reequire consumers to accept either style).
>

I don't understand what you are proposing here.  Why would you want to
represent the same information two different ways?

Perry Greenfield wrote:
>
> I think we need to think about what the typecharacter is supposed
> to represent. Is it the value as the user will see it or to indicate
> what the internal representation is? These are two different things.
>

I think __array_typestr__ should accurately represent the internal
representation.  It is not intended for typical end users.  The whole of
the __array_*metadata*__ stuff is intended for third party libraries like
wxPython or PIL to be able to grab a pointer to the data, calculate
offsets, and cast it to the appropriate type without writing lots of
special case code to handle the differences between Numeric, Numarray,
Numeric3, and whatever else.

>
> Then again, I'm not sure how this info is exposed to the user; if it
> is appropriately handled by intermediate code it may not matter. For
> example, if this corresponds to what the user will see for the type,
> I think it is bad. Most of the time they don't care what the internal
> representation is, they just want to know if it is Int16 or whatever;
> with the two combined, they have to test for both variants.
>

Typical users would call whatever attribute or method you prefer (.type()
or .typecode() for instance), and the type representation could be classes
or typecodes or whatever you think is best.

The __array_typestr__ attribute is not for typical users (unless they start
to care about the details under the hood).  It's for libraries that need to
know what's going on in a generic fashion.  You don't have to store this
attribute as separate data, it can be a property style attribute that
calculates it's value dynamically from your own internal representation.

Francesc Altet wrote:
> 
> Considering that heterogenous data is to be suported as well, and
> there is some tradition of assigning names to the different fields,
> I wonder if it would not be good to add something like:
> 
> __array_names__ (optional comma-separated names for record fields)
> 

I really like this idea.  Although I agree with David M. Cooke that it
should be a tuple of names.  Unless there is a use case I'm not
considering, it would be preferrable if the names were restricted to valid
Python identifiers.

Travis Oliphant wrote:
> 
> After more thought,  I think using the struct-like typecharacters
> is not a good idea for the array protocol.    I think that the
> character codes used by the numarray record array:  kind_character 
> + byte_width is better.  Commas can separate heterogeneous data.
> The problem is that if the data buffer originally came from a
> different machine or saved with a different compiler (e.g. a mmap'ed
> file), then the struct-like typecodes only tell you the c-type that
> machine thought the data was.  It does not tell you how to interpret
> the data on this machine. 
>

The struct module has a portable set of typecodes.  They call it
"standard", but it's the same thing.  The struct module let's you specify
either standard or native.  For instance, the typecode for "standard long"
("=l") is always 4 bytes while a "native long" ("@l") is likely to be 4 or
8 bytes depending on the platform.  The __array_typestr__ codes should
require the "standard" sizes.  There is a table at the bottom of the
documentation that goes into detail:

    http://docs.python.org/lib/module-struct.html

The only problem with the struct module is that it's missing a few types...
(long double, PyObject, unicode, bit).

> 
> I also think that rather than attach < or > to the start of the
> string it would be easier to have another protocol for endianness.
> Perhaps something like:
> 
> __array_endian__  (optional Python integer with the value 1 in it).
> If it is not 1, then a byteswap must be necessary. 
>

This has the problem you were just describing.  Specifying "byteswapped"
like this only tells you if the data was reversed on the machine it came
from.  It doesn't tell you what is correct for the current machine.

Assuming you represented little endian as 0 and big endian as 1, you could
always figure out whether to byteswap like this:

    byteswap = data_endian ^ host_endian

Do you want to have an __array_endian__ where 0 indicates "little endian",
1 indicates "big endian", and the default is whatever the current host
machine uses?  I think this would work for a lot of cases.

A limitation of this approach is that it can't adequately represent
struct/record arrays where some fields are big endian and others are little
endian.

> 
> Bool               -- "b%d" % sizeof(bool)
> Signed Integer     -- "i%d" % sizeof(<some int>)
> Unsigned Integer   -- "u%d" % sizeof(<some uint>)
> Float              -- "f%d" % sizeof(<some float>)
> Complex            -- "c%d" % sizeof(<some complex>)
> Object             -- "O%d" % sizeof(PyObject *)
>          --- this would only be useful on shared memory
> String             -- "S%d" % itemsize
> Unicode            -- "U%d" % itemsize
> Void               -- "V%d" % itemsize   
> 

The above is a nice start at reinventing the struct module typecodes.  If
you and Perry agree to it, that would be great.  A few additions though:

I think you're proposing that "struct" or "record" arrays would be a
concatenation of the above strings.  If so, you'll need an indicator for
padding bytes.  (You probably know this, but structs in C frequently have
wasted bytes inserted by the compiler to make sure data is aligned on the
machine addressable boundaries.)

I also assume that you intend the ("c%d" % itemsize) to always represent
complex floating point numbers.  That leaves my favorite example of complex
short integer data with no way to be represented...  I guess I could get by
with "i2i2".

How about not having a complex type explicitly, but representing complex
data as something like:

     __array_typestr__ = "f4f4
     __array_names__ = ("real", "imag")

Just a thought...  I do like it though.

I think that both Numarray and Numeric3 are planning on storing booleans in
a full byte.  A typecode for tightly packed bits wouldn't go unused
however...

> 
> 1) How do you support > 2Gb memory mapped arrays on 32 bit systems
> and other large-object arrays only a part of which are in memory at
> any given time
> 

Doing this well is a lot like implementing mmap in user space.  I think
this is a modification to the buffer protocol, not the array protocol.  It
would add a bit of complexity if you want to deal with it, but it is
doable.

Instead of just grabbing a pointer to the whole thing, you need to ask the
object to "page in" ranges of the data and give you a pointer that is only
valid in that range.  Then when you're done with the pointer, you need to
explicitly tell the object so that it can write back if necessary and
release the memory for other requests.  Do you think Numeric3 or Numarray
would support this?  I think it would be very cool functionality to have.

>
> (there is an equivalent problem for > 8 Eb (exabytes) on 64 bit
> systems, an Exabyte is 2^60 bytes or a giga-giga-byte).
> 

I think it will be at least 10-20 years before we could realisticly exceed
a 64 bit address space.  Probably a lot longer.  That's a billion times
more RAM than any machine I've ever worked on, and it's a million times
more bytes than any RAID set I've worked with.  Are there any super
computers approaching this level?  Even at Moore's law rates, I'm not
worried about that one just yet.

> 
> But,  I've been thinking about the array protocol and thinking that
> it would be a good thing if this became universal.  One of the ways
> to make it universal is by having something that follows it in the
> Python core.
> 
> So, what if we proposed for the Python core not something like
> Numeric3 (which would still exist in scipy.base and be everybody's
> favorite array :-) ), but a very minimal array object (scaled back
> even from Numeric) that followed the array protocol and had some
> C-API associated with it.
> 
> This minimal array object would support 5 basic types ('bool',
> 'integer', 'float', 'complex', 'Object').   (Maybe a void type
> could be defined and a void "scalar" introduced (which would be
> the bytes object)).  These types correspond to scalars already
> available in Python and so the whole 0-dim array Python scalar
> arguments could be ignored.   
>

I really like this idea.  It could easily be implemented in C or Python
script.  Since half it's purpose is for documentation, the Python script
implementation might make more sense.

Additionally, a module that understood the defaults and did the right thing
with the metadata attributes would be useful:

    def get_ndims(a):
        return len(a.__array_shape__)

    def get_offset(a):
        if hasattr(a, "__array_offset__"):
            return a.__array_offset__
        return 0

    def get_strides(a):
        if hasattr(a, "__array_strides__"):
            return a.array_strides
        # build the default strides from the shape 

    def is_c_contiguous(a):
        shape = a.__array_shape__
        strides = get_strides(a)
        # determine if the strides indicate it is contiguous

    def is_fortran_contiguous(a):
        # similar to is_c_contiguous

etc...

Thes functions could be useful for third party libraries to work with *any*
of the array packages.

> 
> An alternative would be to "add" multidimensionality to the array
> object already part of Python, fix it's reallocating with an exposed
> buffer problem, and add the array protocol.
> 

I'd recommend not breaking backward compatibility on the array.array
object, but adding the __array_*metadata*__ attributes wouldn't hurt
anything.  (The __array_shape__ would always be a tuple of length one, but
that's allowed...).

Magnus Lie Hetland wrote:
> 
> Wohoo! Niiice :)
> 
> (Okay, a bit "me too"-ish, but I just wanted to contribute some
> enthusiasm ;)
> 

I completely agree!  :-)

Cheers,
    -Scott