![](https://secure.gravatar.com/avatar/07a96b1d17ced5239d539a1f8e53e89d.jpg?s=120&d=mm&r=g)
I got back late last night, and there were lots of things I wanted to comment on. I've put parts of several threads into this one message since they're all dealing with the same general topic: Perry Greenfield wrote:
I agree that it's going to be difficult to have general support for large PyBufferProcs objects until the Python core is made 64 bit clean. But specific support can be added for buffer types that are known in advance. For instance, the bytes object PEP proposes an alternate way to get a 64 bit length, and similar support could easily be added to Numarray.memory, mmap.mmap, and whatever else on a case by case basis. So you could get a 64 bit pointer from some types of buffers before the rest of Python becomes 64 bit clean. If the ndarray consumer (wxWindows for instance) doesn't recognize the particular implementation, it has to stick with the limitations of the standard PyBufferProcs and assume a 32 bit length would suffice. Travis Oliphant wrote:
I prefer __array_data__ (it's a common name for Numeric and numarray, It can be interpreted as a sequence object if desired).
So long as everyone agrees it doesn't matter what name it is. Sounds like __array_data__ works for everyone.
I also like __array_typestr__ or __array_typechar__ better as a name.
A name is a name as far as I'm concerned. The name __array_typestr__ works for me. The name __array_typechar__ implies a single character, and that won't be true.
Nothing in this array protocol should *require* internal changes to either Numeric3 or Numarray. I suspect Numarray is going to keep it's type hierarchy, and Numeric3 can use single character codes for it's representation if it wants. However, both Numeric3 and Numarray might (probably would) have to translate their internal array type specifiers into the agreed upon "type code string" when reporting out this attribute. The important qualities __array_typestr__ should have are: 1) Everyone should agree on the interpretation. It needs to be documented somewhere. Third party libraries should get the same __array_typestr__ from Numarray as they do from Numeric3. 2) It should be sufficiently general in it's capabilities to describe a wide category of array types. Simple things should be simple, and harder things should be possible. An ndarray of double should have a simple common well recognized value for __array_typestr__. An ndarray of a multi-field structs should be representable too.
I'm not married to this one. I don't know if Numarray or Numeric3 will ever do such a thing, but I can imagine more complicated schemes of arranging the data than offset/shape/strides are capable of representing. So this is forward compatibility with "Numarric4" :-). Pretty hypothetical, but imagine that typically Numarric4 can represent it's data with offset/shape/strides, but for more advanced operations that falls apart. I could bore you with a detailed example... The idea is that if array consumers like wxPython were aware that more complicated implementations can occur in the future, they can gracefully bow out and raise an exception instead of incorrectly interpreting the data. If you need it later, you can't easily add it after the fact. Take it or leave it I guess - it's possibly a YAGNI.
I don't understand what you are proposing here. Why would you want to represent the same information two different ways? Perry Greenfield wrote:
I think __array_typestr__ should accurately represent the internal representation. It is not intended for typical end users. The whole of the __array_*metadata*__ stuff is intended for third party libraries like wxPython or PIL to be able to grab a pointer to the data, calculate offsets, and cast it to the appropriate type without writing lots of special case code to handle the differences between Numeric, Numarray, Numeric3, and whatever else.
Typical users would call whatever attribute or method you prefer (.type() or .typecode() for instance), and the type representation could be classes or typecodes or whatever you think is best. The __array_typestr__ attribute is not for typical users (unless they start to care about the details under the hood). It's for libraries that need to know what's going on in a generic fashion. You don't have to store this attribute as separate data, it can be a property style attribute that calculates it's value dynamically from your own internal representation. Francesc Altet wrote:
I really like this idea. Although I agree with David M. Cooke that it should be a tuple of names. Unless there is a use case I'm not considering, it would be preferrable if the names were restricted to valid Python identifiers. Travis Oliphant wrote:
The struct module has a portable set of typecodes. They call it "standard", but it's the same thing. The struct module let's you specify either standard or native. For instance, the typecode for "standard long" ("=l") is always 4 bytes while a "native long" ("@l") is likely to be 4 or 8 bytes depending on the platform. The __array_typestr__ codes should require the "standard" sizes. There is a table at the bottom of the documentation that goes into detail: http://docs.python.org/lib/module-struct.html The only problem with the struct module is that it's missing a few types... (long double, PyObject, unicode, bit).
This has the problem you were just describing. Specifying "byteswapped" like this only tells you if the data was reversed on the machine it came from. It doesn't tell you what is correct for the current machine. Assuming you represented little endian as 0 and big endian as 1, you could always figure out whether to byteswap like this: byteswap = data_endian ^ host_endian Do you want to have an __array_endian__ where 0 indicates "little endian", 1 indicates "big endian", and the default is whatever the current host machine uses? I think this would work for a lot of cases. A limitation of this approach is that it can't adequately represent struct/record arrays where some fields are big endian and others are little endian.
The above is a nice start at reinventing the struct module typecodes. If you and Perry agree to it, that would be great. A few additions though: I think you're proposing that "struct" or "record" arrays would be a concatenation of the above strings. If so, you'll need an indicator for padding bytes. (You probably know this, but structs in C frequently have wasted bytes inserted by the compiler to make sure data is aligned on the machine addressable boundaries.) I also assume that you intend the ("c%d" % itemsize) to always represent complex floating point numbers. That leaves my favorite example of complex short integer data with no way to be represented... I guess I could get by with "i2i2". How about not having a complex type explicitly, but representing complex data as something like: __array_typestr__ = "f4f4 __array_names__ = ("real", "imag") Just a thought... I do like it though. I think that both Numarray and Numeric3 are planning on storing booleans in a full byte. A typecode for tightly packed bits wouldn't go unused however...
Doing this well is a lot like implementing mmap in user space. I think this is a modification to the buffer protocol, not the array protocol. It would add a bit of complexity if you want to deal with it, but it is doable. Instead of just grabbing a pointer to the whole thing, you need to ask the object to "page in" ranges of the data and give you a pointer that is only valid in that range. Then when you're done with the pointer, you need to explicitly tell the object so that it can write back if necessary and release the memory for other requests. Do you think Numeric3 or Numarray would support this? I think it would be very cool functionality to have.
(there is an equivalent problem for > 8 Eb (exabytes) on 64 bit systems, an Exabyte is 2^60 bytes or a giga-giga-byte).
I think it will be at least 10-20 years before we could realisticly exceed a 64 bit address space. Probably a lot longer. That's a billion times more RAM than any machine I've ever worked on, and it's a million times more bytes than any RAID set I've worked with. Are there any super computers approaching this level? Even at Moore's law rates, I'm not worried about that one just yet.
I really like this idea. It could easily be implemented in C or Python script. Since half it's purpose is for documentation, the Python script implementation might make more sense. Additionally, a module that understood the defaults and did the right thing with the metadata attributes would be useful: def get_ndims(a): return len(a.__array_shape__) def get_offset(a): if hasattr(a, "__array_offset__"): return a.__array_offset__ return 0 def get_strides(a): if hasattr(a, "__array_strides__"): return a.array_strides # build the default strides from the shape def is_c_contiguous(a): shape = a.__array_shape__ strides = get_strides(a) # determine if the strides indicate it is contiguous def is_fortran_contiguous(a): # similar to is_c_contiguous etc... Thes functions could be useful for third party libraries to work with *any* of the array packages.
I'd recommend not breaking backward compatibility on the array.array object, but adding the __array_*metadata*__ attributes wouldn't hurt anything. (The __array_shape__ would always be a tuple of length one, but that's allowed...). Magnus Lie Hetland wrote:
I completely agree! :-) Cheers, -Scott
![](https://secure.gravatar.com/avatar/5c7407de6b47afcd3b3e2164ff5bcd45.jpg?s=120&d=mm&r=g)
I'm very much with the opinions of Scott. Just some remarks. A Divendres 01 Abril 2005 06:12, Scott Gilbert va escriure:
Ok. I was thinking on easing the life of C extension writers, but I agree that a tuple of names should be relatively easily dealed in C as well. However, as the __array_typestr__ would be a plain string, then an __array_names__ being a plain string would be consistent with that. Also, it would be worth to know how to express a record of different shaped fields. I mean, how to represent a record like: [array(Int32,shape=(2,3)), array(Float64,shape=(3,))] The possibilities are: __array_shapes__ = ((2,3),(3,)) __array_typestr__ = (i,d) Other possibility could be an extension of the current struct format: __array_typestr__ = "(2,3)i(3,)d" more on that later on.
I fully agree with Scott here. Struct typecodes are offering a way to approach the Python standards, and this is a good thing for many developers that knows nothing of array packages and its different typecodes. IMO, the set of portable set of typecodes in struct module should only be abandoned if they cannot fulfil all the requirements of Numeric3/numarray. But I'm pretty confident that they will eventually do.
The only problem with the struct module is that it's missing a few types... (long double, PyObject, unicode, bit).
Well, bit is not used either in Numeric/numarray and I think few people would complain on this (they can always pack bits into bytes). PyObject and unicode can be reduced to a sequence of bytes and some other metadata to the array protocol can be added to complement its meaning (say __array_str_encoding__ = "UTF-8" or similar). long double is the only type that should be added to struct typecodes, but convincing the Python crew to do that should be not difficult, I guess.
Having a mix of different endianess data values in the same data record would be a bit ill-minded. In fact, numarray does not support this: a recarray should be all little or big endian. I think that '<' and '>' would be more than enough to represent this.
Again, I think it would be better to not get away from the struct typecodes. But if you end doing it, well, I would like to propose a couple of additions to the new protocol: 1.- Support shapes for record specification. I'm listing two possibilities: A) __array_typestr__ = "(2,3)i(3,)d" This would be an easy extension of the struct string type definition. B) __array_typestr__ = ("i4","f8") __array_shapes__ = ((2,3),(3,)) This is more 'à la numarray'. 2.- Allow nested datatypes. Although numarray does not support this yet, I think it could be very advantageous to be able to express: [array(Int32,shape=(5,)),[array(Int16,shape=(2,)),array(Float32,shape=(3,4))]] i.e., the first field would be an array of ints with 6 elements, while the second field would be actually another record made of 2 fields: one array of short ints, and other array of simple precision floats. I'm not sure how exactly implement this, but, what about: A) __array_typestr__ = "(5,)i[(2,)h(3,4)f]" B) __array_typestr__ = ("i4",("i2","f8")) __array_shapes__ = ((5,),((2,),(3,4)) Because I'm suggesting to adhere the struct specification, I prefer option A), although I guess option B would be easier to use for developers (even for extension developers).
Yeah, I fully agree with this also. Cheers, --
![](https://secure.gravatar.com/avatar/c7976f03fcae7e1199d28d1c20e34647.jpg?s=120&d=mm&r=g)
Coming in very late... On Apr 1, 2005, at 4:46 AM, Francesc Altet wrote:
I'm very much with the opinions of Scott. Just some remarks.
A Divendres 01 Abril 2005 06:12, Scott Gilbert va escriure:
Nothing intrinsically prevents numarray from allowing this for records, but I'd agree that I have a hard time understanding when a given record array would have mixed endianess.
I'm not against it, but I wonder if it is the most important thing to do next. I can imagine that there are many other issues that deserve more attention than this. But I won't tell Travis what to do, obviously. Likewise about working on the current Python array module. Perry Perry
![](https://secure.gravatar.com/avatar/5c7407de6b47afcd3b3e2164ff5bcd45.jpg?s=120&d=mm&r=g)
I'm very much with the opinions of Scott. Just some remarks. A Divendres 01 Abril 2005 06:12, Scott Gilbert va escriure:
Ok. I was thinking on easing the life of C extension writers, but I agree that a tuple of names should be relatively easily dealed in C as well. However, as the __array_typestr__ would be a plain string, then an __array_names__ being a plain string would be consistent with that. Also, it would be worth to know how to express a record of different shaped fields. I mean, how to represent a record like: [array(Int32,shape=(2,3)), array(Float64,shape=(3,))] The possibilities are: __array_shapes__ = ((2,3),(3,)) __array_typestr__ = (i,d) Other possibility could be an extension of the current struct format: __array_typestr__ = "(2,3)i(3,)d" more on that later on.
I fully agree with Scott here. Struct typecodes are offering a way to approach the Python standards, and this is a good thing for many developers that knows nothing of array packages and its different typecodes. IMO, the set of portable set of typecodes in struct module should only be abandoned if they cannot fulfil all the requirements of Numeric3/numarray. But I'm pretty confident that they will eventually do.
The only problem with the struct module is that it's missing a few types... (long double, PyObject, unicode, bit).
Well, bit is not used either in Numeric/numarray and I think few people would complain on this (they can always pack bits into bytes). PyObject and unicode can be reduced to a sequence of bytes and some other metadata to the array protocol can be added to complement its meaning (say __array_str_encoding__ = "UTF-8" or similar). long double is the only type that should be added to struct typecodes, but convincing the Python crew to do that should be not difficult, I guess.
Having a mix of different endianess data values in the same data record would be a bit ill-minded. In fact, numarray does not support this: a recarray should be all little or big endian. I think that '<' and '>' would be more than enough to represent this.
Again, I think it would be better to not get away from the struct typecodes. But if you end doing it, well, I would like to propose a couple of additions to the new protocol: 1.- Support shapes for record specification. I'm listing two possibilities: A) __array_typestr__ = "(2,3)i(3,)d" This would be an easy extension of the struct string type definition. B) __array_typestr__ = ("i4","f8") __array_shapes__ = ((2,3),(3,)) This is more 'à la numarray'. 2.- Allow nested datatypes. Although numarray does not support this yet, I think it could be very advantageous to be able to express: [array(Int32,shape=(5,)),[array(Int16,shape=(2,)),array(Float32,shape=(3,4))]] i.e., the first field would be an array of ints with 6 elements, while the second field would be actually another record made of 2 fields: one array of short ints, and other array of simple precision floats. I'm not sure how exactly implement this, but, what about: A) __array_typestr__ = "(5,)i[(2,)h(3,4)f]" B) __array_typestr__ = ("i4",("i2","f8")) __array_shapes__ = ((5,),((2,),(3,4)) Because I'm suggesting to adhere the struct specification, I prefer option A), although I guess option B would be easier to use for developers (even for extension developers).
Yeah, I fully agree with this also. Cheers, --
![](https://secure.gravatar.com/avatar/c7976f03fcae7e1199d28d1c20e34647.jpg?s=120&d=mm&r=g)
Coming in very late... On Apr 1, 2005, at 4:46 AM, Francesc Altet wrote:
I'm very much with the opinions of Scott. Just some remarks.
A Divendres 01 Abril 2005 06:12, Scott Gilbert va escriure:
Nothing intrinsically prevents numarray from allowing this for records, but I'd agree that I have a hard time understanding when a given record array would have mixed endianess.
I'm not against it, but I wonder if it is the most important thing to do next. I can imagine that there are many other issues that deserve more attention than this. But I won't tell Travis what to do, obviously. Likewise about working on the current Python array module. Perry Perry
participants (3)
-
Francesc Altet
-
Perry Greenfield
-
Scott Gilbert