
Scott Gilbert wrote:
Adding metadata at the buffer object level causes problems for "view" semantics. Let's say that everyone agreed what "itemsize" and "itemtype" meant:
real_view = complex_array.real
The real_view will have to use a new buffer since they can't share the old one. The buffer used in complex_array would have a typecode like ComplexDouble and an itemsize of 16. The buffer in real_view would need a typecode of Double and an itemsize of 8. If metadata is stored with the buffer object, it can't be the same buffer object in both places.
This is where having "strides" metadata becomes very useful. Then, real_view would not have to be a copy at all, unless the coder didn't want to deal with it.
Another case would be treating a 512x512 image of 4 byte pixels as a 512x512x4 image of 1 byte RGBA elements. Or even coercing from Signed to Unsigned.
Why not? A different bytes object could point to the same memory but the different metadata would say "treat this data differently"
The bytes object as proposed does allow new views to be created from other bytes objects (sharing the same memory underneath), and these views could each have separate metadata, but then you wouldn't be able to have arrays that used other types of buffers.
I don't see why not. Your argument is not clear to me.
The bytes object shouldn't create views from arbitrary other buffer objects because it can't rely on the general semantics of the PyBufferProcs interface. The foreign buffer object might realloc and invalidate the pointer for instance... The current Python "buffer" builtin does this, and the results are bad. So creating a bytes object as a view on the mmap object doesn't work in the general case.
This is a problem with the objects that expose the buffer interface. The C-API could be more clear that you should not "reallocate" memory if another array is referencing you. See the arrayobject's resize method for an example of how Numeric does not allow reallocation of the memory space if another object is referencing it. I suppose you could keep track separately in the object of when another object is using your memory, but the REFCOUNT works for this also (though it is not so specific, and so you would miss cases where you "could" reallocate but this is rarely used in arrayobject's anyway). Another idea is to fix the bytes object so it always regrabs the pointer to memory from the object instead of relying on the held pointer in view situations.
Still, I think keeping the metadata at a different level, and having the bytes object just be the Python way to spell a call to C's malloc will avoid a lot of problems. Read below for how I think the metadata stuff could be handled.
Metadata is such a light-weight "interface-based" solution. It could be as simple as attributes on the bytes object. I don't see why you resist it so much. Imaging defining a jpeg file by a single bytes object with a simple EXIF header metadata string. If the bytes object allowed the "bearmin" attributes you are describing then that would be one way to describe an array that any third-party application could support as much as they wanted. In short, I think we are thinking along similar lines. It really comes down to being accepted by everybody as a standard. One of the things, I want for Numeric3 is to be able to create an array from anything that exports the buffer interface. The problem, of course is with badly-written exentsion modules that rudely reallocate their memory even after they've shared it with someone else. Yes, Python could be improved so that this were handled better, but it does work right now, as long as buffer interface exporters play nice. This is the way to advertise the buffer interface (and buffer object). Rather than vague references to buffer objects being a "bad-design" and a blight we should say: objects wanting to export the buffer interface currently have restrictions on their ability to reallocate their buffers.
I think being able to traffic in N-Dimensional arrays without requiring linking against the libraries is a good thing.
Several of us are just catching on to the idea. Thanks for your patience.
I think the proposal is still relevant today, but I might revise it a bit as follows. A bear minimum N-Dimensional array for interchanging data across libraries could get by with following attributes:
# Create a simple record type for storing attributes class BearMin: pass bm = BearMin()
# Set the attributes sufficient to describe a simple ndarray bm.buffer = <a buffer or sequence object> bm.shape = <a tuple of ints describing it's shape> bm.itemtype = <a string describing the elements>
The bm.buffer and bm.shape attributes are pretty obvious. I would suggest that the bm.itemtype borrow it's typecodes from the Python struct module, but anything that everyone agreed on would work.
I've actually tried to do this if you'll notice, and I'm sure I'll take some heat for that decision at some point too. The only difference currently I think are long types (q and Q), I could be easily persuaded to change thes typecodes too. I agree that the typecode characters are very simple and useful for interchanging information about type. That is a big reason why I am not "abandoning them"
Those attributes are sufficient for someone to *produce* an N-Dimensional array that could be understood by many libraries. Someone who *consumes* the data would need to know a few more:
bm.offset = <an integer offset into the buffer>
I don't like this offset parameter. Why doesn't the buffer just start where it needs too?
bm.strides = <a tuple of ints for non-contiguous or Fortran arrays>
Things are moving this direction (notice that Numeric3 has attributes much like you describe), except we use the word .data (instead of .buffer) It would be an easy thing to return an ArrayObject from an object that exposes those attributes (and a good idea). So, I pretty much agree with what you are saying. I just don't see how this is at odds with attaching metadata to a bytes object. We could start supporting this convention today, and also handle bytes objects with metadata in the future.
There is another really valid argument for using the strategy above to describe metadata instead of wedging it into the bytes object: The Numeric community could agree on the metadata attributes and start using it *today*.
Yes, but this does not mean we should not encourage the addition of metadata to bytes objects (as this has larger uses than just Numeric arrays). It is not a difficult thing to support both concepts.
If you wait until someone commits the bytes object into the core, it won't be generally available until Python version 2.5 at the earliest, and any libraries that depended on using bytes stored metadata would not work with older versions of Python.
I think we should just start advertising now, that with the new methods of numarray and Numeric3, extension writers can right now deal with Numeric arrays (and anything else that exposes the same interface) very easily by using attribute access (or the buffer protocol together with attribute access). They can do this because Numeric arrays (and I suspect numarrays as well) use the buffer interface responsibly (we could start a political campaign encouraging responsible buffer usage everywhere :-) ). -Travis