Adding metadata at the buffer object level causes problems for "view" semantics. Let's say that everyone agreed what "itemsize" and "itemtype" meant: real_view = complex_array.real The real_view will have to use a new buffer since they can't share the old one. The buffer used in complex_array would have a typecode like ComplexDouble and an itemsize of 16. The buffer in real_view would need a typecode of Double and an itemsize of 8. If metadata is stored with the buffer object, it can't be the same buffer object in both places. Another case would be treating a 512x512 image of 4 byte pixels as a 512x512x4 image of 1 byte RGBA elements. Or even coercing from Signed to Unsigned. The bytes object as proposed does allow new views to be created from other bytes objects (sharing the same memory underneath), and these views could each have separate metadata, but then you wouldn't be able to have arrays that used other types of buffers. Having arrays use mmap buffers is very useful. The bytes object shouldn't create views from arbitrary other buffer objects because it can't rely on the general semantics of the PyBufferProcs interface. The foreign buffer object might realloc and invalidate the pointer for instance... The current Python "buffer" builtin does this, and the results are bad. So creating a bytes object as a view on the mmap object doesn't work in the general case. Actually, now that I think about it, the mmap object might be safe. I don't believe the current implementation of mmap does any reallocing under the scenes and I think the pointer stays valid for the lifetime of the object. If we verified that mmap is safe enough, bytes could make a special case out of it, but then you would be locked into bytes and mmap only. Maybe that's acceptable... Still, I think keeping the metadata at a different level, and having the bytes object just be the Python way to spell a call to C's malloc will avoid a lot of problems. Read below for how I think the metadata stuff could be handled. --- Chris Barker <Chris.Barker@noaa.gov> wrote:
There are any number of Third party extensions that could benefit from being able to directly read the data in Numeric* arrays: PIL, wxPython, etc. Etc. My personal example is wxPython:
At the moment, you can pass a Numeric or numarray array into wxPython, and it will be converted to a wxList of wxPoints (for instance), but that is done by using the generic sequence protocol, and a lot of type checking. As you can imagine, that is pretty darn slow, compared to just typecasting the data pointer and looping through it. Robin Dunn, quite reasonably, doesn't want wxPython to depend on Numeric, so that's what we've got.
My understanding of this memory object is that an extension like wxPython wouldn't not need to know about Numeric, but could simply get the memory Object, and there would be enough meta-data with it to typecast and loop through the data. I'm a bit skeptical about how this would work. It seems that the metadata required would be the full set of stuff in an array Object already:
type dimensions strides
This could be made a bit simpler by allowing only contiguous arrays, but then there would need to be a contiguous flag.
To make use of this, wxPython would have to know a fair bit about Numeric Arrays anyway, so that it can check to see if the data is appropriate. I guess the advantage is that while the wxPython code would have to know about Numeric arrays, it wouldn't have to include Numeric headers or code.
I think being able to traffic in N-Dimensional arrays without requiring linking against the libraries is a good thing. Several years ago, I proposed a solution to this problem. Actually I did a really poor job of proposing it and irritated a lot of people in the process. I'm embarrassed to post a link to the following thread, but here it is anyway: http://aspn.activestate.com/ASPN/Mail/Message/numpy-discussion/1166013 Accept my appologies if you read the whole thing just now. :-) Accept my sincere appologies if you read it at the time. I think the proposal is still relevant today, but I might revise it a bit as follows. A bear minimum N-Dimensional array for interchanging data across libraries could get by with following attributes: # Create a simple record type for storing attributes class BearMin: pass bm = BearMin() # Set the attributes sufficient to describe a simple ndarray bm.buffer = <a buffer or sequence object> bm.shape = <a tuple of ints describing it's shape> bm.itemtype = <a string describing the elements> The bm.buffer and bm.shape attributes are pretty obvious. I would suggest that the bm.itemtype borrow it's typecodes from the Python struct module, but anything that everyone agreed on would work. (The struct module is nice because it is already documented and supports native and portable types of many sizes in both endians. It also supports composite struct types.) Those attributes are sufficient for someone to *produce* an N-Dimensional array that could be understood by many libraries. Someone who *consumes* the data would need to know a few more: bm.offset = <an integer offset into the buffer> bm.strides = <a tuple of ints for non-contiguous or Fortran arrays> The value of bm.offset would default to zero if it wasn't present, and the the tuple bm.strides could be generated from the shape assuming it was a C style array. Subscripting operations that returned non-contiguous views of shared data could change bm.offset to non-zero. Subscripting would also affect the bm.strides, and creating a Fortran style array would require bm.strides to be present. You might also choose to add bm.itemsize in addition to bm.itemtype when you can describe how big elements are, but you can't sufficiently describe what the data is using the agreed upon typecodes. This would be uncommon. The default for bm.itemsize would come from struct.calcsize(bm.itemtype). You might also choose to add bm.complicated for when the array layout can't be described by the shape/offset/stride combination. For instance bm.complicated might get used when creating views from more sophisticated subscripting operations like index arrays or mask arrays. Although it looks like Numeric3 plans on making new contiguous copies in those cases. The C implementations of arrays would only have to add getattr like methods, and the data could be stored very compactly.
From those minimum 5-7 attributes (metadata), an N-Dimensional array consumer could determine most everything it needed to know about the data. Simple routines could determine things like iscontiguous(bm), iscarray(bm) or isfortran(bm). I expect libraries like wxPython or PIL could punt (raise an exception) when the water gets too deep.
It also doesn't prohibit other attributes from being added. Just because an N-Dimensional array described it's itemtype using the struct module typecodes doesn't mean that it couldn't implement more sophisticated typing hierarchies with a different attribute. There are a few commonly used types like "long double" which are not supported by the struct module, but this could be addressed with a little discussion. Also you might want a "bit" or "Object" typecode for tightly packed mask arrays and Object arrays. The names could be argued about, and something like: bm.__array_buffer__ bm.__array_shape__ bm.__array_itemtype__ bm.__array_offset__ bm.__array_strides__ bm.__array_itemsize__ bm.__array_complicated__ would really bring home the notion that the attributes are a description of what it means to participate in an N-Dimensional array protocol. Plus names this long and ugly are unlikely to step on the existing attributes already in use by Numeric3 and Numarray. :-) Anyway, I proposed this a long time ago, but the belief was that one of the standard array packages would make it into the core very soon. With a standard array library in the core, there wouldn't be as much need for general interoperability like this. Everyone could just use the standard. Maybe that position would change now that Numeric3 and Numarray both look to have long futures. Even if one package made it in, the other is likely to live on. I personally think the competition is a good thing. We don't need to have only one array package to get interoperability. I would definitely like to see the Python core acquire a full fledged array package like Numeric3 or Numarray. When I log onto a new Linux or MacOS machine, the array package would just be there. No installs, no hassle. But I still think a simple community agreed upon set of attributes like this would be a good idea. --- Peter Verveer <verveer@embl.de> wrote:
It think it would be a real shame not to support non-contiguous data. It would be great if such a byte object could be used instead of Numeric/numarray arrays when writing extensions. Then I could write C extensions that could be made available very easily/efficiently to any package supporting it without having to worry about the specific C api of those packages. If only contiguous byte objects are supported that byte object is not a good option anymore for implementing extensions for Numeric unless I am prepared to live with a lot of copying of non-contiguous arrays.
I'm hoping I made a good case for a slightly different strategy above. But even if the metadata did go into the bytes object itself, the metadata could describe a non-contiguous layout on top of the contiguous chunk of memory. There is another really valid argument for using the strategy above to describe metadata instead of wedging it into the bytes object: The Numeric community could agree on the metadata attributes and start using it *today*. If you wait until someone commits the bytes object into the core, it won't be generally available until Python version 2.5 at the earliest, and any libraries that depended on using bytes stored metadata would not work with older versions of Python. Cheers, -Scott