Hi Travis. I'm quite possibly misunderstanding how you want to incorporate the metadata into the bytes object, so I'm going to try and restate both of our positions from the point of view of a third party who will be using ndarrays. Let's take Chris Barker's point of view with regards to wxPython... We all roughly agree which pieces of metadata are needed for arrays. There are a few persnicketies, and the names could vary. I'll use your given names: .data (could be .buffer or .__array_buffer__) .shape (could be .dimensions or .__array_shape__) .strides (maybe .__array_strides__) .itemtype (coulb be .typecode or .__array_itemtype__) Several other attributes can be derived (calculated) from those (isfortran, iscontiguous, etc...), and we might need a few more, but we'll ignore those for now. In my proposal, Chris would write a routine like such: def version_one(a): data = a.data shape = a.shape strides = a.strides itemtype = a.itemtype # Cool code goes here I believe you are suggesting Chris would write: def version_two(a): data = a shape = a.shape strides = a.strides itemtype = a.itemtype # Cool code goes here Of if you have the .meta dictionary, Chris would write: def version_three(a): data = a shape = a.meta["shape"] strides = a.meta["strides"] itemtype = a.meta["itemtype"] # Cool code goes here Of course Chris could save one line of code with: def version_two_point_one(data): shape = a.shape strides = a.strides itemtype = a.itemtype # Cool code goes here If I'm mistaken about your proposal, please let me know. However if I'm not mistaken, I think there are limitations with version_two and version_three. First, most of the existing buffer objects do not allow attributes to be added to them. With version_one, Chris could have data of type array.array, Numarray.memory, mmap.mmap, __builtins__.str, the new __builtins__.bytes type as well as any other PyBufferProcs supporting object (and possibly sequence objects like __builtins__.list). With version_two and version_three, something more is required. In a few cases like the __builtins__.str type you could add the necessary attributes by inheritance. In other cases like the mmap.mmap, you could wrap it with a __builtins__.bytes object. (That's assuming that __builtins__.bytes knows how to wrap mmap.mmap objects...) However, other PyBufferProcs objects like array.array will never allow themselves to be wrapped by a __builtins__.bytes since they realloc their memory and violate the promises that the __builtins__.bytes object makes. I think you disagree with me on this part, so more on that later in this message. For now I'll take your side, let's pretend that all PyBufferProcs supporting objects could be made well enough behaved to wrap up in a __builtins__.bytes object. Do you really want to require that only __builtins__.bytes objects are suitable for data interchange across libraries? This isn't explicitly stated by you, but since the __builtins__.bytes object is the only common PyBufferProcs supporting object that could define the metadata attributes, it would be the rule in practice. I think you're losing flexibility if you do it this way. From Chris's point of view it's basically the same amount of code for all three versions above. Another consideration that might sway you is that the existing N-Dimensional array packages could easily add attribute methods to implement the interface, and they could do this without changing any part of their implementation. The .data attribute when requested would call a "get method" that returns a buffer. This allows user defined objects which do not implement the PyBufferProcs protocol themselves, but which contain a buffer inside of them to participate in the "ndarray protocol". Both version_two and version_three do not allow this - the object being passed must *be* a buffer.
The bytes object shouldn't create views from arbitrary other buffer objects because it can't rely on the general semantics of the PyBufferProcs interface. The foreign buffer object might realloc and invalidate the pointer for instance... The current Python "buffer" builtin does this, and the results are bad. So creating a bytes object as a view on the mmap object doesn't work in the general case.
This is a problem with the objects that expose the buffer interface. The C-API could be more clear that you should not "reallocate" memory if another array is referencing you. See the arrayobject's resize method for an example of how Numeric does not allow reallocation of the memory space if another object is referencing it. I suppose you could keep track separately in the object of when another object is using your memory, but the REFCOUNT works for this also (though it is not so specific, and so you would miss cases where you "could" reallocate but this is rarely used in arrayobject's anyway).
The reference count on the PyObject pointer is different than the number of users using the memory. In Python you could have: import array a = array.array('d', [1]) b = a The reference count on the array.array object is 2, but there are 0 users working with the memory. Given the existing semantics of the array.array object, it really should be allowed to resize in this case. Storing the object in a dictionary would be another common situation that would increase it's refcount but shouldn't lock down the memory. A good solution to this problem was presented with PEP-298, but no progress seems to have been made on it. http://www.python.org/peps/pep-0298.html To my memory, PEP-298 was in response to PEP-296. I proposed PEP-296 to create a good working buffer (bytes) object that avoided the problems of the other buffer objects. Several folks wanted to fix the other (non bytes) objects where possible, and PEP-298 was the result. A strategy like this could be used to make array.array safe outside of the GIL. Bummer that it didn't get implemented.
Another idea is to fix the bytes object so it always regrabs the pointer to memory from the object instead of relying on the held pointer in view situations.
A while back, I submitted a patch [552438] like this to fix the __builtins__.buffer object: http://sourceforge.net/tracker/index.php?func=detail&aid=552438&group_id=5470&atid=305470 It was ignored for a bit, and during the quiet time I came to realize that even if the __builtins__.buffer object was fixed, it still wouldn't meet my needs. So I proposed the bytes object, and this patch fell on the floor (the __builtins__.buffer object is still broken). The downside to this approach is that it only solves the problem for code running with posession of the GIL. It does solve the stale pointer problem that is exposed by the __builtins__.buffer object, but if you release the GIL in C code, all bets are off - the pointer can become stale again. The promises that bytes tries to make about the lifetime of the pointer can only be guaranteed by the object itself. Just because bytes could wrap the other object and grab the latest pointer when you need it doesn't mean that the other object won't invalidate the pointer a split second later when the GIL is released. It is mere chance that the mmap object is well behaved enough. And even the mmap object can release it's memory if someone closes the object - again leading to a stale pointer.
Metadata is such a light-weight "interface-based" solution. It could be as simple as attributes on the bytes object. I don't see why you resist it so much. Imaging defining a jpeg file by a single bytes object with a simple EXIF header metadata string. If the bytes object allowed the "bearmin" attributes you are describing then that would be one way to describe an array that any third-party application could support as much as they wanted.
Please don't think I'm offering you resistance. I'm only trying to point out some things that I think you might have overlooked. Lots of people ignore my suggestions all the time. You'd be in good company if you did too, and I wouldn't even hold a grudge against you. Now let me be argumentative. :-) I've listed what I consider the disadvantages above, but I guess I don't see any advantages of putting the metadata on the bytes object. In what way is: jpeg = bytes(<some data source>) jpeg.exif = <EXIF header metadata string> better than: class record: pass jpeg = record() jpeg.data = <some data source, possibly bytes or something else> jpeg.exif = <EXIF metadata string> The only advantage I see if that yours is a little shorter, but in any real application, you were probably going to define an object of some sort to add all the methods needed. And as I showed up in version_one, version_two, and version_three above, it's basically the same number of lines for the consumer of the data. There is nothing stopping a PyBufferProcs object like bytes from supporting version_one above: jpeg = bytes(<some data source>) jpeg.data = jpeg jpeg.exif = <EXIF header metadata string> But non PyBufferProcs objects can't play with version_two or version_three. Incidently, being able to add attributes to bytes means that it needs to play nicely with the garbage collection system. At that point, bytes is basically a container for arbitrary Python objects. That's additional implementation headache.
It really comes down to being accepted by everybody as a standard.
This I completely agree with. I think the community will roll with whatever you and Perry come to agree on. Even the array.array object in the core could be made to work either way. If the decision you come up with makes it easy to add the interface to existing array objects then everyone would probably adopt it and it would become a standard. This is the main reason I like the double underscore __*meta*__ names. It matches the similar pattern all over Python, and existing array packages could add those without interfering with their existing implementation: class Numarray: # # lots of array implementing code # # Down here at the end, add the "well-known" interface # (I haven't embraced the @property decorator syntax yet.) def __get_shape(self): return self._shape __array_shape__ = property(__get_shape) def __get_data(self): # Note that they use a different name internally return self._buffer __array_data__ = property(__get_data) def __get_itemtype(self): # Perform an on the fly conversion from the class # hierarchy type to the struct module typecode that # closest matches return self._type._to_typecode() __array_itemtype__ = property(__get_itemtype) Changing class Numarray to a PyBufferProcs supporting object would be harder. The C version for Numeric3 arrays would be similar, and there is no wasted space on a per instance basis in either case.
One of the things, I want for Numeric3 is to be able to create an array from anything that exports the buffer interface. The problem, of course is with badly-written exentsion modules that rudely reallocate their memory even after they've shared it with someone else. Yes, Python could be improved so that this were handled better, but it does work right now, as long as buffer interface exporters play nice.
I think the behavior of the array.array objects are pretty defensible. It is useful that you can extend those arrays to new sizes. For all I know, it was written that way before there was a GIL. I think PEP-298 is a good way to make the dynamic buffers more GIL friendly.
This is the way to advertise the buffer interface (and buffer object). Rather than vague references to buffer objects being a "bad-design" and a blight we should say: objects wanting to export the buffer interface currently have restrictions on their ability to reallocate their buffers.
I agree. The "bad-design" type of comments about "the buffer problem" on python-dev have always annoyed me. It's not that hard of a problem to solve technically.
I would suggest that the bm.itemtype borrow it's typecodes from the Python struct module, but anything that everyone agreed on would work.
I've actually tried to do this if you'll notice, and I'm sure I'll take some heat for that decision at some point too. The only difference currently I think are long types (q and Q), I could be easily persuaded to change thes typecodes too. I agree that the typecode characters are very simple and useful for interchanging information about type. That is a big reason why I am not "abandoning them"
The real advantage to the struct module typecodes comes in two forms. First and most important is that it's already documented and in place - a defacto standard. Second is that Python script code could use those typecodes directly with the struct module to pull apart pieces of data. The disadvantage is that a few new typecodes would be needed... I would even go as far as to recommend their '>' '<' prefix codes for big-endian and little-endian for just this reason...
I don't like this offset parameter. Why doesn't the buffer just start where it needs too?
Well if you stick with using the bytes object, you could probably get away with this. Effectively, the offset is encoded in the bytes object. At this point, I don't know if anything I said above was pursuasive, but I think there are other cases where you would really want this. Does anyone plan to support tightly packed (8 bits to a byte) bitmask arrays? Object arrays could be implemented on top of shared __builtins__.list objects, and there is no easy way to create offset views into lists.
It would be an easy thing to return an ArrayObject from an object that exposes those attributes (and a good idea).
This would be wonderful. Third party libraries could produce data that is sufficiently ndarray like without hassle, and users of that library could promote it to a Numeric3 array with no headaches.
So, I pretty much agree with what you are saying. I just don't see how this is at odds with attaching metadata to a bytes object.
We could start supporting this convention today, and also handle bytes objects with metadata in the future.
Unfortunately, I don't think any buffer objects exist today which have the ability to dynamically add attributes. If my arguments above are unpursuasive, I believe bytes (once it is written) will be the only buffer object with this support. By the way, it looks like the "bytes" concept has been revisited recently. there is a new PEP dated Aug 11, 2004: http://www.python.org/peps/pep-0332.html
There is another really valid argument for using the strategy above to describe metadata instead of wedging it into the bytes object: The Numeric community could agree on the metadata attributes and start using it *today*.
I think we should just start advertising now, that with the new methods of numarray and Numeric3, extension writers can right now deal with Numeric arrays (and anything else that exposes the same interface) very easily by using attribute access (or the buffer protocol together with attribute access). They can do this because Numeric arrays (and I suspect numarrays as well) use the buffer interface responsibly (we could start a political campaign encouraging responsible buffer usage everywhere :-) ).
I can just imagine the horrible mascot that would be involved in the PR campaign. Thanks for your attention and patience with me on this. I really appreciate the work you are doing. I wish I could explain my understanding of things more clearly. Cheers, -Scott