I got back late last night, and there were lots of things I wanted to comment on. I've put parts of several threads into this one message since they're all dealing with the same general topic: Perry Greenfield wrote:
I'm not sure how the support for large data sets should be handled. I generally think that it will be very awkward to handle these until Python does as well. Speaking of which...
I agree that it's going to be difficult to have general support for large PyBufferProcs objects until the Python core is made 64 bit clean. But specific support can be added for buffer types that are known in advance. For instance, the bytes object PEP proposes an alternate way to get a 64 bit length, and similar support could easily be added to Numarray.memory, mmap.mmap, and whatever else on a case by case basis. So you could get a 64 bit pointer from some types of buffers before the rest of Python becomes 64 bit clean. If the ndarray consumer (wxWindows for instance) doesn't recognize the particular implementation, it has to stick with the limitations of the standard PyBufferProcs and assume a 32 bit length would suffice. Travis Oliphant wrote:
I prefer __array_data__ (it's a common name for Numeric and numarray, It can be interpreted as a sequence object if desired).
So long as everyone agrees it doesn't matter what name it is. Sounds like __array_data__ works for everyone.
I also like __array_typestr__ or __array_typechar__ better as a name.
A name is a name as far as I'm concerned. The name __array_typestr__ works for me. The name __array_typechar__ implies a single character, and that won't be true.
Don't like 'D' for long double. Complex floats is already using it. I'm not sure I like the idea of moving to two character typecodes at this point because it indicates more internal changes to Numeric3 (otherwise we have two typecharacter standards which is not a good thing). What is wrong with 'g' and 'G' for long double and complex long double respectively.
Nothing in this array protocol should *require* internal changes to either Numeric3 or Numarray. I suspect Numarray is going to keep it's type hierarchy, and Numeric3 can use single character codes for it's representation if it wants. However, both Numeric3 and Numarray might (probably would) have to translate their internal array type specifiers into the agreed upon "type code string" when reporting out this attribute. The important qualities __array_typestr__ should have are: 1) Everyone should agree on the interpretation. It needs to be documented somewhere. Third party libraries should get the same __array_typestr__ from Numarray as they do from Numeric3. 2) It should be sufficiently general in it's capabilities to describe a wide category of array types. Simple things should be simple, and harder things should be possible. An ndarray of double should have a simple common well recognized value for __array_typestr__. An ndarray of a multi-field structs should be representable too.
__array_complicated__
I don't see the utility here I guess, If it can't be described by a shape/strides combination then how can it participate in the protocol?
I'm not married to this one. I don't know if Numarray or Numeric3 will ever do such a thing, but I can imagine more complicated schemes of arranging the data than offset/shape/strides are capable of representing. So this is forward compatibility with "Numarric4" :-). Pretty hypothetical, but imagine that typically Numarric4 can represent it's data with offset/shape/strides, but for more advanced operations that falls apart. I could bore you with a detailed example... The idea is that if array consumers like wxPython were aware that more complicated implementations can occur in the future, they can gracefully bow out and raise an exception instead of incorrectly interpreting the data. If you need it later, you can't easily add it after the fact. Take it or leave it I guess - it's possibly a YAGNI.
After more thought, I think here we need to also allow the "c-type" independent way of describing an array (i.e. numarray introduced 'c4' for a complex-valued 4 byte itemsize array). So, perhaps __array_ctypestr_ and __array_typestr__ should be two ways to get the information (or overload the __array_typestr__ interface and reequire consumers to accept either style).
I don't understand what you are proposing here. Why would you want to represent the same information two different ways? Perry Greenfield wrote:
I think we need to think about what the typecharacter is supposed to represent. Is it the value as the user will see it or to indicate what the internal representation is? These are two different things.
I think __array_typestr__ should accurately represent the internal representation. It is not intended for typical end users. The whole of the __array_*metadata*__ stuff is intended for third party libraries like wxPython or PIL to be able to grab a pointer to the data, calculate offsets, and cast it to the appropriate type without writing lots of special case code to handle the differences between Numeric, Numarray, Numeric3, and whatever else.
Then again, I'm not sure how this info is exposed to the user; if it is appropriately handled by intermediate code it may not matter. For example, if this corresponds to what the user will see for the type, I think it is bad. Most of the time they don't care what the internal representation is, they just want to know if it is Int16 or whatever; with the two combined, they have to test for both variants.
Typical users would call whatever attribute or method you prefer (.type() or .typecode() for instance), and the type representation could be classes or typecodes or whatever you think is best. The __array_typestr__ attribute is not for typical users (unless they start to care about the details under the hood). It's for libraries that need to know what's going on in a generic fashion. You don't have to store this attribute as separate data, it can be a property style attribute that calculates it's value dynamically from your own internal representation. Francesc Altet wrote:
Considering that heterogenous data is to be suported as well, and there is some tradition of assigning names to the different fields, I wonder if it would not be good to add something like:
__array_names__ (optional comma-separated names for record fields)
I really like this idea. Although I agree with David M. Cooke that it should be a tuple of names. Unless there is a use case I'm not considering, it would be preferrable if the names were restricted to valid Python identifiers. Travis Oliphant wrote:
After more thought, I think using the struct-like typecharacters is not a good idea for the array protocol. I think that the character codes used by the numarray record array: kind_character + byte_width is better. Commas can separate heterogeneous data. The problem is that if the data buffer originally came from a different machine or saved with a different compiler (e.g. a mmap'ed file), then the struct-like typecodes only tell you the c-type that machine thought the data was. It does not tell you how to interpret the data on this machine.
The struct module has a portable set of typecodes. They call it "standard", but it's the same thing. The struct module let's you specify either standard or native. For instance, the typecode for "standard long" ("=l") is always 4 bytes while a "native long" ("@l") is likely to be 4 or 8 bytes depending on the platform. The __array_typestr__ codes should require the "standard" sizes. There is a table at the bottom of the documentation that goes into detail: http://docs.python.org/lib/module-struct.html The only problem with the struct module is that it's missing a few types... (long double, PyObject, unicode, bit).
I also think that rather than attach < or > to the start of the string it would be easier to have another protocol for endianness. Perhaps something like:
__array_endian__ (optional Python integer with the value 1 in it). If it is not 1, then a byteswap must be necessary.
This has the problem you were just describing. Specifying "byteswapped" like this only tells you if the data was reversed on the machine it came from. It doesn't tell you what is correct for the current machine. Assuming you represented little endian as 0 and big endian as 1, you could always figure out whether to byteswap like this: byteswap = data_endian ^ host_endian Do you want to have an __array_endian__ where 0 indicates "little endian", 1 indicates "big endian", and the default is whatever the current host machine uses? I think this would work for a lot of cases. A limitation of this approach is that it can't adequately represent struct/record arrays where some fields are big endian and others are little endian.
Bool -- "b%d" % sizeof(bool) Signed Integer -- "i%d" % sizeof(<some int>) Unsigned Integer -- "u%d" % sizeof(<some uint>) Float -- "f%d" % sizeof(<some float>) Complex -- "c%d" % sizeof(<some complex>) Object -- "O%d" % sizeof(PyObject *) --- this would only be useful on shared memory String -- "S%d" % itemsize Unicode -- "U%d" % itemsize Void -- "V%d" % itemsize
The above is a nice start at reinventing the struct module typecodes. If you and Perry agree to it, that would be great. A few additions though: I think you're proposing that "struct" or "record" arrays would be a concatenation of the above strings. If so, you'll need an indicator for padding bytes. (You probably know this, but structs in C frequently have wasted bytes inserted by the compiler to make sure data is aligned on the machine addressable boundaries.) I also assume that you intend the ("c%d" % itemsize) to always represent complex floating point numbers. That leaves my favorite example of complex short integer data with no way to be represented... I guess I could get by with "i2i2". How about not having a complex type explicitly, but representing complex data as something like: __array_typestr__ = "f4f4 __array_names__ = ("real", "imag") Just a thought... I do like it though. I think that both Numarray and Numeric3 are planning on storing booleans in a full byte. A typecode for tightly packed bits wouldn't go unused however...
1) How do you support > 2Gb memory mapped arrays on 32 bit systems and other large-object arrays only a part of which are in memory at any given time
Doing this well is a lot like implementing mmap in user space. I think this is a modification to the buffer protocol, not the array protocol. It would add a bit of complexity if you want to deal with it, but it is doable. Instead of just grabbing a pointer to the whole thing, you need to ask the object to "page in" ranges of the data and give you a pointer that is only valid in that range. Then when you're done with the pointer, you need to explicitly tell the object so that it can write back if necessary and release the memory for other requests. Do you think Numeric3 or Numarray would support this? I think it would be very cool functionality to have.
(there is an equivalent problem for > 8 Eb (exabytes) on 64 bit systems, an Exabyte is 2^60 bytes or a giga-giga-byte).
I think it will be at least 10-20 years before we could realisticly exceed a 64 bit address space. Probably a lot longer. That's a billion times more RAM than any machine I've ever worked on, and it's a million times more bytes than any RAID set I've worked with. Are there any super computers approaching this level? Even at Moore's law rates, I'm not worried about that one just yet.
But, I've been thinking about the array protocol and thinking that it would be a good thing if this became universal. One of the ways to make it universal is by having something that follows it in the Python core.
So, what if we proposed for the Python core not something like Numeric3 (which would still exist in scipy.base and be everybody's favorite array :-) ), but a very minimal array object (scaled back even from Numeric) that followed the array protocol and had some C-API associated with it.
This minimal array object would support 5 basic types ('bool', 'integer', 'float', 'complex', 'Object'). (Maybe a void type could be defined and a void "scalar" introduced (which would be the bytes object)). These types correspond to scalars already available in Python and so the whole 0-dim array Python scalar arguments could be ignored.
I really like this idea. It could easily be implemented in C or Python script. Since half it's purpose is for documentation, the Python script implementation might make more sense. Additionally, a module that understood the defaults and did the right thing with the metadata attributes would be useful: def get_ndims(a): return len(a.__array_shape__) def get_offset(a): if hasattr(a, "__array_offset__"): return a.__array_offset__ return 0 def get_strides(a): if hasattr(a, "__array_strides__"): return a.array_strides # build the default strides from the shape def is_c_contiguous(a): shape = a.__array_shape__ strides = get_strides(a) # determine if the strides indicate it is contiguous def is_fortran_contiguous(a): # similar to is_c_contiguous etc... Thes functions could be useful for third party libraries to work with *any* of the array packages.
An alternative would be to "add" multidimensionality to the array object already part of Python, fix it's reallocating with an exposed buffer problem, and add the array protocol.
I'd recommend not breaking backward compatibility on the array.array object, but adding the __array_*metadata*__ attributes wouldn't hurt anything. (The __array_shape__ would always be a tuple of length one, but that's allowed...). Magnus Lie Hetland wrote:
Wohoo! Niiice :)
(Okay, a bit "me too"-ish, but I just wanted to contribute some enthusiasm ;)
I completely agree! :-) Cheers, -Scott
I'm very much with the opinions of Scott. Just some remarks. A Divendres 01 Abril 2005 06:12, Scott Gilbert va escriure:
__array_names__ (optional comma-separated names for record fields)
I really like this idea. Although I agree with David M. Cooke that it should be a tuple of names. Unless there is a use case I'm not considering, it would be preferrable if the names were restricted to valid Python identifiers.
Ok. I was thinking on easing the life of C extension writers, but I agree that a tuple of names should be relatively easily dealed in C as well. However, as the __array_typestr__ would be a plain string, then an __array_names__ being a plain string would be consistent with that. Also, it would be worth to know how to express a record of different shaped fields. I mean, how to represent a record like: [array(Int32,shape=(2,3)), array(Float64,shape=(3,))] The possibilities are: __array_shapes__ = ((2,3),(3,)) __array_typestr__ = (i,d) Other possibility could be an extension of the current struct format: __array_typestr__ = "(2,3)i(3,)d" more on that later on.
The struct module has a portable set of typecodes. They call it "standard", but it's the same thing. The struct module let's you specify either standard or native. For instance, the typecode for "standard long" ("=l") is always 4 bytes while a "native long" ("@l") is likely to be 4 or 8 bytes depending on the platform. The __array_typestr__ codes should require the "standard" sizes. There is a table at the bottom of the documentation that goes into detail:
I fully agree with Scott here. Struct typecodes are offering a way to approach the Python standards, and this is a good thing for many developers that knows nothing of array packages and its different typecodes. IMO, the set of portable set of typecodes in struct module should only be abandoned if they cannot fulfil all the requirements of Numeric3/numarray. But I'm pretty confident that they will eventually do.
The only problem with the struct module is that it's missing a few types... (long double, PyObject, unicode, bit).
Well, bit is not used either in Numeric/numarray and I think few people would complain on this (they can always pack bits into bytes). PyObject and unicode can be reduced to a sequence of bytes and some other metadata to the array protocol can be added to complement its meaning (say __array_str_encoding__ = "UTF-8" or similar). long double is the only type that should be added to struct typecodes, but convincing the Python crew to do that should be not difficult, I guess.
I also think that rather than attach < or > to the start of the string it would be easier to have another protocol for endianness. Perhaps something like:
__array_endian__ (optional Python integer with the value 1 in it). If it is not 1, then a byteswap must be necessary.
A limitation of this approach is that it can't adequately represent struct/record arrays where some fields are big endian and others are little endian.
Having a mix of different endianess data values in the same data record would be a bit ill-minded. In fact, numarray does not support this: a recarray should be all little or big endian. I think that '<' and '>' would be more than enough to represent this.
Bool -- "b%d" % sizeof(bool) Signed Integer -- "i%d" % sizeof(<some int>) Unsigned Integer -- "u%d" % sizeof(<some uint>) Float -- "f%d" % sizeof(<some float>) Complex -- "c%d" % sizeof(<some complex>) Object -- "O%d" % sizeof(PyObject *) --- this would only be useful on shared memory String -- "S%d" % itemsize Unicode -- "U%d" % itemsize Void -- "V%d" % itemsize
The above is a nice start at reinventing the struct module typecodes. If you and Perry agree to it, that would be great. A few additions though:
Again, I think it would be better to not get away from the struct typecodes. But if you end doing it, well, I would like to propose a couple of additions to the new protocol: 1.- Support shapes for record specification. I'm listing two possibilities: A) __array_typestr__ = "(2,3)i(3,)d" This would be an easy extension of the struct string type definition. B) __array_typestr__ = ("i4","f8") __array_shapes__ = ((2,3),(3,)) This is more 'à la numarray'. 2.- Allow nested datatypes. Although numarray does not support this yet, I think it could be very advantageous to be able to express: [array(Int32,shape=(5,)),[array(Int16,shape=(2,)),array(Float32,shape=(3,4))]] i.e., the first field would be an array of ints with 6 elements, while the second field would be actually another record made of 2 fields: one array of short ints, and other array of simple precision floats. I'm not sure how exactly implement this, but, what about: A) __array_typestr__ = "(5,)i[(2,)h(3,4)f]" B) __array_typestr__ = ("i4",("i2","f8")) __array_shapes__ = ((5,),((2,),(3,4)) Because I'm suggesting to adhere the struct specification, I prefer option A), although I guess option B would be easier to use for developers (even for extension developers).
So, what if we proposed for the Python core not something like Numeric3 (which would still exist in scipy.base and be everybody's favorite array :-) ), but a very minimal array object (scaled back even from Numeric) that followed the array protocol and had some C-API associated with it.
This minimal array object would support 5 basic types ('bool', 'integer', 'float', 'complex', 'Object'). (Maybe a void type could be defined and a void "scalar" introduced (which would be the bytes object)). These types correspond to scalars already available in Python and so the whole 0-dim array Python scalar arguments could be ignored.
I really like this idea. It could easily be implemented in C or Python script. Since half it's purpose is for documentation, the Python script implementation might make more sense.
Yeah, I fully agree with this also. Cheers, --
qo< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data ""
Coming in very late... On Apr 1, 2005, at 4:46 AM, Francesc Altet wrote:
I'm very much with the opinions of Scott. Just some remarks.
A Divendres 01 Abril 2005 06:12, Scott Gilbert va escriure:
I also think that rather than attach < or > to the start of the string it would be easier to have another protocol for endianness. Perhaps something like:
__array_endian__ (optional Python integer with the value 1 in it). If it is not 1, then a byteswap must be necessary.
A limitation of this approach is that it can't adequately represent struct/record arrays where some fields are big endian and others are little endian.
Having a mix of different endianess data values in the same data record would be a bit ill-minded. In fact, numarray does not support this: a recarray should be all little or big endian. I think that '<' and '>' would be more than enough to represent this.
Nothing intrinsically prevents numarray from allowing this for records, but I'd agree that I have a hard time understanding when a given record array would have mixed endianess.
So, what if we proposed for the Python core not something like Numeric3 (which would still exist in scipy.base and be everybody's favorite array :-) ), but a very minimal array object (scaled back even from Numeric) that followed the array protocol and had some C-API associated with it.
This minimal array object would support 5 basic types ('bool', 'integer', 'float', 'complex', 'Object'). (Maybe a void type could be defined and a void "scalar" introduced (which would be the bytes object)). These types correspond to scalars already available in Python and so the whole 0-dim array Python scalar arguments could be ignored.
I really like this idea. It could easily be implemented in C or Python script. Since half it's purpose is for documentation, the Python script implementation might make more sense.
Yeah, I fully agree with this also.
I'm not against it, but I wonder if it is the most important thing to do next. I can imagine that there are many other issues that deserve more attention than this. But I won't tell Travis what to do, obviously. Likewise about working on the current Python array module. Perry Perry
participants (3)
-
Francesc Altet
-
Perry Greenfield
-
Scott Gilbert