Adding metadata at the buffer object level causes problems for "view" semantics. Let's say that everyone agreed what "itemsize" and "itemtype" meant: real_view = complex_array.real The real_view will have to use a new buffer since they can't share the old one. The buffer used in complex_array would have a typecode like ComplexDouble and an itemsize of 16. The buffer in real_view would need a typecode of Double and an itemsize of 8. If metadata is stored with the buffer object, it can't be the same buffer object in both places. Another case would be treating a 512x512 image of 4 byte pixels as a 512x512x4 image of 1 byte RGBA elements. Or even coercing from Signed to Unsigned. The bytes object as proposed does allow new views to be created from other bytes objects (sharing the same memory underneath), and these views could each have separate metadata, but then you wouldn't be able to have arrays that used other types of buffers. Having arrays use mmap buffers is very useful. The bytes object shouldn't create views from arbitrary other buffer objects because it can't rely on the general semantics of the PyBufferProcs interface. The foreign buffer object might realloc and invalidate the pointer for instance... The current Python "buffer" builtin does this, and the results are bad. So creating a bytes object as a view on the mmap object doesn't work in the general case. Actually, now that I think about it, the mmap object might be safe. I don't believe the current implementation of mmap does any reallocing under the scenes and I think the pointer stays valid for the lifetime of the object. If we verified that mmap is safe enough, bytes could make a special case out of it, but then you would be locked into bytes and mmap only. Maybe that's acceptable... Still, I think keeping the metadata at a different level, and having the bytes object just be the Python way to spell a call to C's malloc will avoid a lot of problems. Read below for how I think the metadata stuff could be handled. --- Chris Barker <Chris.Barker@noaa.gov> wrote:
There are any number of Third party extensions that could benefit from being able to directly read the data in Numeric* arrays: PIL, wxPython, etc. Etc. My personal example is wxPython:
At the moment, you can pass a Numeric or numarray array into wxPython, and it will be converted to a wxList of wxPoints (for instance), but that is done by using the generic sequence protocol, and a lot of type checking. As you can imagine, that is pretty darn slow, compared to just typecasting the data pointer and looping through it. Robin Dunn, quite reasonably, doesn't want wxPython to depend on Numeric, so that's what we've got.
My understanding of this memory object is that an extension like wxPython wouldn't not need to know about Numeric, but could simply get the memory Object, and there would be enough meta-data with it to typecast and loop through the data. I'm a bit skeptical about how this would work. It seems that the metadata required would be the full set of stuff in an array Object already:
type dimensions strides
This could be made a bit simpler by allowing only contiguous arrays, but then there would need to be a contiguous flag.
To make use of this, wxPython would have to know a fair bit about Numeric Arrays anyway, so that it can check to see if the data is appropriate. I guess the advantage is that while the wxPython code would have to know about Numeric arrays, it wouldn't have to include Numeric headers or code.
I think being able to traffic in N-Dimensional arrays without requiring linking against the libraries is a good thing. Several years ago, I proposed a solution to this problem. Actually I did a really poor job of proposing it and irritated a lot of people in the process. I'm embarrassed to post a link to the following thread, but here it is anyway: http://aspn.activestate.com/ASPN/Mail/Message/numpy-discussion/1166013 Accept my appologies if you read the whole thing just now. :-) Accept my sincere appologies if you read it at the time. I think the proposal is still relevant today, but I might revise it a bit as follows. A bear minimum N-Dimensional array for interchanging data across libraries could get by with following attributes: # Create a simple record type for storing attributes class BearMin: pass bm = BearMin() # Set the attributes sufficient to describe a simple ndarray bm.buffer = <a buffer or sequence object> bm.shape = <a tuple of ints describing it's shape> bm.itemtype = <a string describing the elements> The bm.buffer and bm.shape attributes are pretty obvious. I would suggest that the bm.itemtype borrow it's typecodes from the Python struct module, but anything that everyone agreed on would work. (The struct module is nice because it is already documented and supports native and portable types of many sizes in both endians. It also supports composite struct types.) Those attributes are sufficient for someone to *produce* an N-Dimensional array that could be understood by many libraries. Someone who *consumes* the data would need to know a few more: bm.offset = <an integer offset into the buffer> bm.strides = <a tuple of ints for non-contiguous or Fortran arrays> The value of bm.offset would default to zero if it wasn't present, and the the tuple bm.strides could be generated from the shape assuming it was a C style array. Subscripting operations that returned non-contiguous views of shared data could change bm.offset to non-zero. Subscripting would also affect the bm.strides, and creating a Fortran style array would require bm.strides to be present. You might also choose to add bm.itemsize in addition to bm.itemtype when you can describe how big elements are, but you can't sufficiently describe what the data is using the agreed upon typecodes. This would be uncommon. The default for bm.itemsize would come from struct.calcsize(bm.itemtype). You might also choose to add bm.complicated for when the array layout can't be described by the shape/offset/stride combination. For instance bm.complicated might get used when creating views from more sophisticated subscripting operations like index arrays or mask arrays. Although it looks like Numeric3 plans on making new contiguous copies in those cases. The C implementations of arrays would only have to add getattr like methods, and the data could be stored very compactly.
From those minimum 5-7 attributes (metadata), an N-Dimensional array consumer could determine most everything it needed to know about the data. Simple routines could determine things like iscontiguous(bm), iscarray(bm) or isfortran(bm). I expect libraries like wxPython or PIL could punt (raise an exception) when the water gets too deep.
It also doesn't prohibit other attributes from being added. Just because an N-Dimensional array described it's itemtype using the struct module typecodes doesn't mean that it couldn't implement more sophisticated typing hierarchies with a different attribute. There are a few commonly used types like "long double" which are not supported by the struct module, but this could be addressed with a little discussion. Also you might want a "bit" or "Object" typecode for tightly packed mask arrays and Object arrays. The names could be argued about, and something like: bm.__array_buffer__ bm.__array_shape__ bm.__array_itemtype__ bm.__array_offset__ bm.__array_strides__ bm.__array_itemsize__ bm.__array_complicated__ would really bring home the notion that the attributes are a description of what it means to participate in an N-Dimensional array protocol. Plus names this long and ugly are unlikely to step on the existing attributes already in use by Numeric3 and Numarray. :-) Anyway, I proposed this a long time ago, but the belief was that one of the standard array packages would make it into the core very soon. With a standard array library in the core, there wouldn't be as much need for general interoperability like this. Everyone could just use the standard. Maybe that position would change now that Numeric3 and Numarray both look to have long futures. Even if one package made it in, the other is likely to live on. I personally think the competition is a good thing. We don't need to have only one array package to get interoperability. I would definitely like to see the Python core acquire a full fledged array package like Numeric3 or Numarray. When I log onto a new Linux or MacOS machine, the array package would just be there. No installs, no hassle. But I still think a simple community agreed upon set of attributes like this would be a good idea. --- Peter Verveer <verveer@embl.de> wrote:
It think it would be a real shame not to support non-contiguous data. It would be great if such a byte object could be used instead of Numeric/numarray arrays when writing extensions. Then I could write C extensions that could be made available very easily/efficiently to any package supporting it without having to worry about the specific C api of those packages. If only contiguous byte objects are supported that byte object is not a good option anymore for implementing extensions for Numeric unless I am prepared to live with a lot of copying of non-contiguous arrays.
I'm hoping I made a good case for a slightly different strategy above. But even if the metadata did go into the bytes object itself, the metadata could describe a non-contiguous layout on top of the contiguous chunk of memory. There is another really valid argument for using the strategy above to describe metadata instead of wedging it into the bytes object: The Numeric community could agree on the metadata attributes and start using it *today*. If you wait until someone commits the bytes object into the core, it won't be generally available until Python version 2.5 at the earliest, and any libraries that depended on using bytes stored metadata would not work with older versions of Python. Cheers, -Scott
Scott Gilbert wrote:
Adding metadata at the buffer object level causes problems for "view" semantics. Let's say that everyone agreed what "itemsize" and "itemtype" meant:
real_view = complex_array.real
The real_view will have to use a new buffer since they can't share the old one. The buffer used in complex_array would have a typecode like ComplexDouble and an itemsize of 16. The buffer in real_view would need a typecode of Double and an itemsize of 8. If metadata is stored with the buffer object, it can't be the same buffer object in both places.
This is where having "strides" metadata becomes very useful. Then, real_view would not have to be a copy at all, unless the coder didn't want to deal with it.
Another case would be treating a 512x512 image of 4 byte pixels as a 512x512x4 image of 1 byte RGBA elements. Or even coercing from Signed to Unsigned.
Why not? A different bytes object could point to the same memory but the different metadata would say "treat this data differently"
The bytes object as proposed does allow new views to be created from other bytes objects (sharing the same memory underneath), and these views could each have separate metadata, but then you wouldn't be able to have arrays that used other types of buffers.
I don't see why not. Your argument is not clear to me.
The bytes object shouldn't create views from arbitrary other buffer objects because it can't rely on the general semantics of the PyBufferProcs interface. The foreign buffer object might realloc and invalidate the pointer for instance... The current Python "buffer" builtin does this, and the results are bad. So creating a bytes object as a view on the mmap object doesn't work in the general case.
This is a problem with the objects that expose the buffer interface. The C-API could be more clear that you should not "reallocate" memory if another array is referencing you. See the arrayobject's resize method for an example of how Numeric does not allow reallocation of the memory space if another object is referencing it. I suppose you could keep track separately in the object of when another object is using your memory, but the REFCOUNT works for this also (though it is not so specific, and so you would miss cases where you "could" reallocate but this is rarely used in arrayobject's anyway). Another idea is to fix the bytes object so it always regrabs the pointer to memory from the object instead of relying on the held pointer in view situations.
Still, I think keeping the metadata at a different level, and having the bytes object just be the Python way to spell a call to C's malloc will avoid a lot of problems. Read below for how I think the metadata stuff could be handled.
Metadata is such a light-weight "interface-based" solution. It could be as simple as attributes on the bytes object. I don't see why you resist it so much. Imaging defining a jpeg file by a single bytes object with a simple EXIF header metadata string. If the bytes object allowed the "bearmin" attributes you are describing then that would be one way to describe an array that any third-party application could support as much as they wanted. In short, I think we are thinking along similar lines. It really comes down to being accepted by everybody as a standard. One of the things, I want for Numeric3 is to be able to create an array from anything that exports the buffer interface. The problem, of course is with badly-written exentsion modules that rudely reallocate their memory even after they've shared it with someone else. Yes, Python could be improved so that this were handled better, but it does work right now, as long as buffer interface exporters play nice. This is the way to advertise the buffer interface (and buffer object). Rather than vague references to buffer objects being a "bad-design" and a blight we should say: objects wanting to export the buffer interface currently have restrictions on their ability to reallocate their buffers.
I think being able to traffic in N-Dimensional arrays without requiring linking against the libraries is a good thing.
Several of us are just catching on to the idea. Thanks for your patience.
I think the proposal is still relevant today, but I might revise it a bit as follows. A bear minimum N-Dimensional array for interchanging data across libraries could get by with following attributes:
# Create a simple record type for storing attributes class BearMin: pass bm = BearMin()
# Set the attributes sufficient to describe a simple ndarray bm.buffer = <a buffer or sequence object> bm.shape = <a tuple of ints describing it's shape> bm.itemtype = <a string describing the elements>
The bm.buffer and bm.shape attributes are pretty obvious. I would suggest that the bm.itemtype borrow it's typecodes from the Python struct module, but anything that everyone agreed on would work.
I've actually tried to do this if you'll notice, and I'm sure I'll take some heat for that decision at some point too. The only difference currently I think are long types (q and Q), I could be easily persuaded to change thes typecodes too. I agree that the typecode characters are very simple and useful for interchanging information about type. That is a big reason why I am not "abandoning them"
Those attributes are sufficient for someone to *produce* an N-Dimensional array that could be understood by many libraries. Someone who *consumes* the data would need to know a few more:
bm.offset = <an integer offset into the buffer>
I don't like this offset parameter. Why doesn't the buffer just start where it needs too?
bm.strides = <a tuple of ints for non-contiguous or Fortran arrays>
Things are moving this direction (notice that Numeric3 has attributes much like you describe), except we use the word .data (instead of .buffer) It would be an easy thing to return an ArrayObject from an object that exposes those attributes (and a good idea). So, I pretty much agree with what you are saying. I just don't see how this is at odds with attaching metadata to a bytes object. We could start supporting this convention today, and also handle bytes objects with metadata in the future.
There is another really valid argument for using the strategy above to describe metadata instead of wedging it into the bytes object: The Numeric community could agree on the metadata attributes and start using it *today*.
Yes, but this does not mean we should not encourage the addition of metadata to bytes objects (as this has larger uses than just Numeric arrays). It is not a difficult thing to support both concepts.
If you wait until someone commits the bytes object into the core, it won't be generally available until Python version 2.5 at the earliest, and any libraries that depended on using bytes stored metadata would not work with older versions of Python.
I think we should just start advertising now, that with the new methods of numarray and Numeric3, extension writers can right now deal with Numeric arrays (and anything else that exposes the same interface) very easily by using attribute access (or the buffer protocol together with attribute access). They can do this because Numeric arrays (and I suspect numarrays as well) use the buffer interface responsibly (we could start a political campaign encouraging responsible buffer usage everywhere :-) ). -Travis
Hi Travis. I'm quite possibly misunderstanding how you want to incorporate the metadata into the bytes object, so I'm going to try and restate both of our positions from the point of view of a third party who will be using ndarrays. Let's take Chris Barker's point of view with regards to wxPython... We all roughly agree which pieces of metadata are needed for arrays. There are a few persnicketies, and the names could vary. I'll use your given names: .data (could be .buffer or .__array_buffer__) .shape (could be .dimensions or .__array_shape__) .strides (maybe .__array_strides__) .itemtype (coulb be .typecode or .__array_itemtype__) Several other attributes can be derived (calculated) from those (isfortran, iscontiguous, etc...), and we might need a few more, but we'll ignore those for now. In my proposal, Chris would write a routine like such: def version_one(a): data = a.data shape = a.shape strides = a.strides itemtype = a.itemtype # Cool code goes here I believe you are suggesting Chris would write: def version_two(a): data = a shape = a.shape strides = a.strides itemtype = a.itemtype # Cool code goes here Of if you have the .meta dictionary, Chris would write: def version_three(a): data = a shape = a.meta["shape"] strides = a.meta["strides"] itemtype = a.meta["itemtype"] # Cool code goes here Of course Chris could save one line of code with: def version_two_point_one(data): shape = a.shape strides = a.strides itemtype = a.itemtype # Cool code goes here If I'm mistaken about your proposal, please let me know. However if I'm not mistaken, I think there are limitations with version_two and version_three. First, most of the existing buffer objects do not allow attributes to be added to them. With version_one, Chris could have data of type array.array, Numarray.memory, mmap.mmap, __builtins__.str, the new __builtins__.bytes type as well as any other PyBufferProcs supporting object (and possibly sequence objects like __builtins__.list). With version_two and version_three, something more is required. In a few cases like the __builtins__.str type you could add the necessary attributes by inheritance. In other cases like the mmap.mmap, you could wrap it with a __builtins__.bytes object. (That's assuming that __builtins__.bytes knows how to wrap mmap.mmap objects...) However, other PyBufferProcs objects like array.array will never allow themselves to be wrapped by a __builtins__.bytes since they realloc their memory and violate the promises that the __builtins__.bytes object makes. I think you disagree with me on this part, so more on that later in this message. For now I'll take your side, let's pretend that all PyBufferProcs supporting objects could be made well enough behaved to wrap up in a __builtins__.bytes object. Do you really want to require that only __builtins__.bytes objects are suitable for data interchange across libraries? This isn't explicitly stated by you, but since the __builtins__.bytes object is the only common PyBufferProcs supporting object that could define the metadata attributes, it would be the rule in practice. I think you're losing flexibility if you do it this way. From Chris's point of view it's basically the same amount of code for all three versions above. Another consideration that might sway you is that the existing N-Dimensional array packages could easily add attribute methods to implement the interface, and they could do this without changing any part of their implementation. The .data attribute when requested would call a "get method" that returns a buffer. This allows user defined objects which do not implement the PyBufferProcs protocol themselves, but which contain a buffer inside of them to participate in the "ndarray protocol". Both version_two and version_three do not allow this - the object being passed must *be* a buffer.
The bytes object shouldn't create views from arbitrary other buffer objects because it can't rely on the general semantics of the PyBufferProcs interface. The foreign buffer object might realloc and invalidate the pointer for instance... The current Python "buffer" builtin does this, and the results are bad. So creating a bytes object as a view on the mmap object doesn't work in the general case.
This is a problem with the objects that expose the buffer interface. The C-API could be more clear that you should not "reallocate" memory if another array is referencing you. See the arrayobject's resize method for an example of how Numeric does not allow reallocation of the memory space if another object is referencing it. I suppose you could keep track separately in the object of when another object is using your memory, but the REFCOUNT works for this also (though it is not so specific, and so you would miss cases where you "could" reallocate but this is rarely used in arrayobject's anyway).
The reference count on the PyObject pointer is different than the number of users using the memory. In Python you could have: import array a = array.array('d', [1]) b = a The reference count on the array.array object is 2, but there are 0 users working with the memory. Given the existing semantics of the array.array object, it really should be allowed to resize in this case. Storing the object in a dictionary would be another common situation that would increase it's refcount but shouldn't lock down the memory. A good solution to this problem was presented with PEP-298, but no progress seems to have been made on it. http://www.python.org/peps/pep-0298.html To my memory, PEP-298 was in response to PEP-296. I proposed PEP-296 to create a good working buffer (bytes) object that avoided the problems of the other buffer objects. Several folks wanted to fix the other (non bytes) objects where possible, and PEP-298 was the result. A strategy like this could be used to make array.array safe outside of the GIL. Bummer that it didn't get implemented.
Another idea is to fix the bytes object so it always regrabs the pointer to memory from the object instead of relying on the held pointer in view situations.
A while back, I submitted a patch [552438] like this to fix the __builtins__.buffer object: http://sourceforge.net/tracker/index.php?func=detail&aid=552438&group_id=5470&atid=305470 It was ignored for a bit, and during the quiet time I came to realize that even if the __builtins__.buffer object was fixed, it still wouldn't meet my needs. So I proposed the bytes object, and this patch fell on the floor (the __builtins__.buffer object is still broken). The downside to this approach is that it only solves the problem for code running with posession of the GIL. It does solve the stale pointer problem that is exposed by the __builtins__.buffer object, but if you release the GIL in C code, all bets are off - the pointer can become stale again. The promises that bytes tries to make about the lifetime of the pointer can only be guaranteed by the object itself. Just because bytes could wrap the other object and grab the latest pointer when you need it doesn't mean that the other object won't invalidate the pointer a split second later when the GIL is released. It is mere chance that the mmap object is well behaved enough. And even the mmap object can release it's memory if someone closes the object - again leading to a stale pointer.
Metadata is such a light-weight "interface-based" solution. It could be as simple as attributes on the bytes object. I don't see why you resist it so much. Imaging defining a jpeg file by a single bytes object with a simple EXIF header metadata string. If the bytes object allowed the "bearmin" attributes you are describing then that would be one way to describe an array that any third-party application could support as much as they wanted.
Please don't think I'm offering you resistance. I'm only trying to point out some things that I think you might have overlooked. Lots of people ignore my suggestions all the time. You'd be in good company if you did too, and I wouldn't even hold a grudge against you. Now let me be argumentative. :-) I've listed what I consider the disadvantages above, but I guess I don't see any advantages of putting the metadata on the bytes object. In what way is: jpeg = bytes(<some data source>) jpeg.exif = <EXIF header metadata string> better than: class record: pass jpeg = record() jpeg.data = <some data source, possibly bytes or something else> jpeg.exif = <EXIF metadata string> The only advantage I see if that yours is a little shorter, but in any real application, you were probably going to define an object of some sort to add all the methods needed. And as I showed up in version_one, version_two, and version_three above, it's basically the same number of lines for the consumer of the data. There is nothing stopping a PyBufferProcs object like bytes from supporting version_one above: jpeg = bytes(<some data source>) jpeg.data = jpeg jpeg.exif = <EXIF header metadata string> But non PyBufferProcs objects can't play with version_two or version_three. Incidently, being able to add attributes to bytes means that it needs to play nicely with the garbage collection system. At that point, bytes is basically a container for arbitrary Python objects. That's additional implementation headache.
It really comes down to being accepted by everybody as a standard.
This I completely agree with. I think the community will roll with whatever you and Perry come to agree on. Even the array.array object in the core could be made to work either way. If the decision you come up with makes it easy to add the interface to existing array objects then everyone would probably adopt it and it would become a standard. This is the main reason I like the double underscore __*meta*__ names. It matches the similar pattern all over Python, and existing array packages could add those without interfering with their existing implementation: class Numarray: # # lots of array implementing code # # Down here at the end, add the "well-known" interface # (I haven't embraced the @property decorator syntax yet.) def __get_shape(self): return self._shape __array_shape__ = property(__get_shape) def __get_data(self): # Note that they use a different name internally return self._buffer __array_data__ = property(__get_data) def __get_itemtype(self): # Perform an on the fly conversion from the class # hierarchy type to the struct module typecode that # closest matches return self._type._to_typecode() __array_itemtype__ = property(__get_itemtype) Changing class Numarray to a PyBufferProcs supporting object would be harder. The C version for Numeric3 arrays would be similar, and there is no wasted space on a per instance basis in either case.
One of the things, I want for Numeric3 is to be able to create an array from anything that exports the buffer interface. The problem, of course is with badly-written exentsion modules that rudely reallocate their memory even after they've shared it with someone else. Yes, Python could be improved so that this were handled better, but it does work right now, as long as buffer interface exporters play nice.
I think the behavior of the array.array objects are pretty defensible. It is useful that you can extend those arrays to new sizes. For all I know, it was written that way before there was a GIL. I think PEP-298 is a good way to make the dynamic buffers more GIL friendly.
This is the way to advertise the buffer interface (and buffer object). Rather than vague references to buffer objects being a "bad-design" and a blight we should say: objects wanting to export the buffer interface currently have restrictions on their ability to reallocate their buffers.
I agree. The "bad-design" type of comments about "the buffer problem" on python-dev have always annoyed me. It's not that hard of a problem to solve technically.
I would suggest that the bm.itemtype borrow it's typecodes from the Python struct module, but anything that everyone agreed on would work.
I've actually tried to do this if you'll notice, and I'm sure I'll take some heat for that decision at some point too. The only difference currently I think are long types (q and Q), I could be easily persuaded to change thes typecodes too. I agree that the typecode characters are very simple and useful for interchanging information about type. That is a big reason why I am not "abandoning them"
The real advantage to the struct module typecodes comes in two forms. First and most important is that it's already documented and in place - a defacto standard. Second is that Python script code could use those typecodes directly with the struct module to pull apart pieces of data. The disadvantage is that a few new typecodes would be needed... I would even go as far as to recommend their '>' '<' prefix codes for big-endian and little-endian for just this reason...
I don't like this offset parameter. Why doesn't the buffer just start where it needs too?
Well if you stick with using the bytes object, you could probably get away with this. Effectively, the offset is encoded in the bytes object. At this point, I don't know if anything I said above was pursuasive, but I think there are other cases where you would really want this. Does anyone plan to support tightly packed (8 bits to a byte) bitmask arrays? Object arrays could be implemented on top of shared __builtins__.list objects, and there is no easy way to create offset views into lists.
It would be an easy thing to return an ArrayObject from an object that exposes those attributes (and a good idea).
This would be wonderful. Third party libraries could produce data that is sufficiently ndarray like without hassle, and users of that library could promote it to a Numeric3 array with no headaches.
So, I pretty much agree with what you are saying. I just don't see how this is at odds with attaching metadata to a bytes object.
We could start supporting this convention today, and also handle bytes objects with metadata in the future.
Unfortunately, I don't think any buffer objects exist today which have the ability to dynamically add attributes. If my arguments above are unpursuasive, I believe bytes (once it is written) will be the only buffer object with this support. By the way, it looks like the "bytes" concept has been revisited recently. there is a new PEP dated Aug 11, 2004: http://www.python.org/peps/pep-0332.html
There is another really valid argument for using the strategy above to describe metadata instead of wedging it into the bytes object: The Numeric community could agree on the metadata attributes and start using it *today*.
I think we should just start advertising now, that with the new methods of numarray and Numeric3, extension writers can right now deal with Numeric arrays (and anything else that exposes the same interface) very easily by using attribute access (or the buffer protocol together with attribute access). They can do this because Numeric arrays (and I suspect numarrays as well) use the buffer interface responsibly (we could start a political campaign encouraging responsible buffer usage everywhere :-) ).
I can just imagine the horrible mascot that would be involved in the PR campaign. Thanks for your attention and patience with me on this. I really appreciate the work you are doing. I wish I could explain my understanding of things more clearly. Cheers, -Scott
Scott, Thank you for your detailed explanations. This is starting to make more sense to me. It is obvious that you understand what we are trying to do, and I pretty much agree with you in how you think it should be done. I think you do a great job of explaining things. I agree we should come up with a set of names for the interface to arrayobjects. I'm even convinced that offset should be an optional part of the interface (implied 0 if it's not there).
However, other PyBufferProcs objects like array.array will never allow themselves to be wrapped by a __builtins__.bytes since they realloc their memory and violate the promises that the __builtins__.bytes object makes. I think you disagree with me on this part, so more on that later in this message.
I think I agree with you: array.array shouldn't allow itself to by wrapped by a bytes object because it reallocates without tracking what it's shared.
Another consideration that might sway you is that the existing N-Dimensional array packages could easily add attribute methods to implement the interface, and they could do this without changing any part of their implementation. The .data attribute when requested would call a "get method" that returns a buffer. This allows user defined objects which do not implement the PyBufferProcs protocol themselves, but which contain a buffer inside of them to participate in the "ndarray protocol". Both version_two and version_three do not allow this - the object being passed must *be* a buffer.
I am not at all against the ndarray protocol you describe. In fact, I'm quite a fan. I think we should start doing it, now. I was just wondering if adding attributes to the bytes object was useful in any case. Your arguments have persuaded me that it is not worth the trouble. Underscore names are a good idea. We already have __array__ which is a protocol for returning an array object: Currently Numeric3 already implements this protocol minus name differences. So, let's come up with names. I'm happy with __array__XXXXX type names as it does dovetail nicely with the already established __array__ name which Numeric3 expects will return an actual array object. As I've already said, it would be easy to check for the more specialized attributes at object creation time to boot-strap an array from an arbitrary object. In addition, to what you state. Why not also have the protocol look at the object itself to expose the PyBufferProcs protocol if it doesn't expose a .__array__data method?
The reference count on the PyObject pointer is different than the number of users using the memory. In Python you could have:
Your examples explaining this are good, but I did realize this, that's why I stated that the check in arr.resize is overkill and will disallow situations that could actually work. Do you think the Numeric3 arrayobject should have a "memory pointer count" added to the PyArrayObject structure?
Please don't think I'm offering you resistance. I'm only trying to point out some things that I think you might have overlooked. Lots of people ignore my suggestions all the time. You'd be in good company if you did too, and I wouldn't even hold a grudge against you.
I very much appreciate the pointers. I had overlooked some things and I believe your suggestions are better.
class Numarray: # # lots of array implementing code #
# Down here at the end, add the "well-known" interface # (I haven't embraced the @property decorator syntax yet.)
def __get_shape(self): return self._shape __array_shape__ = property(__get_shape)
def __get_data(self): # Note that they use a different name internally return self._buffer __array_data__ = property(__get_data)
def __get_itemtype(self): # Perform an on the fly conversion from the class # hierarchy type to the struct module typecode that # closest matches return self._type._to_typecode() __array_itemtype__ = property(__get_itemtype)
Changing class Numarray to a PyBufferProcs supporting object would be harder.
I think they just did this, though...
The C version for Numeric3 arrays would be similar, and there is no wasted space on a per instance basis in either case.
Doing this in C would be extremely easy a simple binding of a name to an already available function (and disallowing any set attribute).
The real advantage to the struct module typecodes comes in two forms. First and most important is that it's already documented and in place - a defacto standard. Second is that Python script code could use those typecodes directly with the struct module to pull apart pieces of data. The disadvantage is that a few new typecodes would be needed...
I would even go as far as to recommend their '>' '<' prefix codes for big-endian and little-endian for just this reason...
Hmm.. an interesting idea. I don't know if I agree or not.
This would be wonderful. Third party libraries could produce data that is sufficiently ndarray like without hassle, and users of that library could promote it to a Numeric3 array with no headaches.
By the way, it looks like the "bytes" concept has been revisited recently. there is a new PEP dated Aug 11, 2004:
Thanks for the pointer.
Thanks for your attention and patience with me on this. I really appreciate the work you are doing. I wish I could explain my understanding of things more clearly.
As I said before, you do a really good job of explaining. I'm pretty much on your side now :-) Let's go ahead and get some __array__XXXXX attribute names decided on. I'll put them in the Numeric3 code base (I could also put them in old Numeric and make a 24.0 release as well --- I need to do that because of a horrible bug in the new empty method: Numeric.empty(<shape>, 'O'). -Travis
Hi Travis, Scott, I've been following your discussions and I'm very happy that Travis has finally decided to go with adopting the bytes object in Numeric3. It's also very important that from the discussions, you finally reached an almost complete agreement on how to support the __array__ protocol. I do think that this idea is both very simple and powerful. I do hope this would be a *major* step towards interchanging data between differents applications and packages and, perhaps, this would render almost a non-sense the final goal of including a specific ndarray object in the Python standard library: this simply should be not necessary at all! A Dilluns 28 Març 2005 11:30, Travis Oliphant va escriure: [snip]
As I've already said, it would be easy to check for the more specialized attributes at object creation time to boot-strap an array from an arbitrary object. [snip] Let's go ahead and get some __array__XXXXX attribute names decided on. I'll put them in the Numeric3 code base (I could also put them in old Numeric and make a 24.0 release as well --- I need to do that because of a horrible bug in the new empty method: Numeric.empty(<shape>, 'O').
Very nice! From what you stated above I deduce that you will be including a case in the Numeric.array constructor so that it can create a properly defined array if the sequence that is passed to it fulfils the __array__ protocol. In addition, if the numarray people would be willing to do the same thing, I envision a very easy (and very efficient) way to convert from/to Numeric to/from numarray (until Numeric3 would be ready for production), something like: NumericArray = Numeric.array(numarrayArray) numarrayArray = numarray.array(NumericArray) Internally, one should decide which is the optimum way to convert from one object to the other. Based on suggestions from Todd Miller on how to do this as efficiently as possible, I have arrived to the conclusions that the next conversions are the most efficient ones: In [69]:na = numarray.arange(100*1000,shape=(100,1000)) In [70]:num = Numeric.arange(100*1000);num=num.resize((100,1000)) In [72]:t1=time();num2=Numeric.fromstring(na._data, typecode=na.typecode());num2=num2.resize(na.shape);time()-t1 Out[72]:0.0017759799957275391 In [73]:t1=time();na2=numarray.fromstring(num.tostring(),type=num.typecode(),shape=num.shape);time()-t1 Out[73]:0.0039050579071044922 Both ways, although very efficient, still copy the data area in the conversion process. In the future, when Numeric3 will support the bytes object, there will be no copy of memory at all for interchanging data with another package (i.e. numarray). Until then, the __array__ protocol may contribute to share data (well, at least contiguous data) efficiently between applications right now. A big thanks to Scott for suggesting and heartedly defending the bytes object and to Travis for unrecklessly becoming a convert. We, the developers of extensions, will be grateful forever :-) Cheers, --
qo< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data ""
A Dilluns 28 Març 2005 17:13, Francesc Altet va escriure: [snip]
Based on suggestions from Todd Miller on how to do this as efficiently as possible, I have arrived to the conclusions that the next conversions are the most efficient ones:
In [69]:na = numarray.arange(100*1000,shape=(100,1000)) In [70]:num = Numeric.arange(100*1000);num=num.resize((100,1000))
In [72]:t1=time();num2=Numeric.fromstring(na._data, typecode=na.typecode());num2=num2.resize(na.shape);time()-t1 Out[72]:0.0017759799957275391 In [73]:t1=time();na2=numarray.fromstring(num.tostring(),type=num.typecode(),s hape=num.shape);time()-t1 Out[73]:0.0039050579071044922
Er, sorry, there is in fact a more efficient way to convert from a Numeric object to a numarray object that doesn't require any data copy at all. This is: In [212]:num=Numeric.arange(100*1000, typecode="i");num=num.resize((100,1000)) In [213]:num[0,:5] Out[213]:array([0, 1, 2, 3, 4],'i') In [214]:t1=time();na2=numarray.array(numarray.memory.writeable_buffer(num),type=num.typecode(),shape=num.shape);time()-t1 Out[214]:0.0001010894775390625 # takes just 100 us! In [215]:na2[0,4] = 1 # modify a cell In [216]:num[0,:5] Out[216]:array([0, 1, 2, 3, 1],'i') In [217]:na2[0,:5] Out[217]:array([0, 1, 2, 3, 1]) # na2 has been modified as well, so the # data area is shared between num and na2 in fact, its speed is independent of the array size (as it should be for a non-data-copying procedure): # Create a Numeric object 10x larger In [218]:num=Numeric.arange(1000*1000, typecode="i");num=num.resize((1000,1000)) In [219]:t1=time();na2=numarray.array(numarray.memory.writeable_buffer(num),type=num.typecode(),shape=num.shape);time()-t1 Out[219]:0.00010204315185546875 # 100 us again! This is because numarray has chosen to use a buffer object internally, and that the Numeric object can be wrapped by a buffer object without any actual data copy. That drives me to think that, if the bytes object (that seems to be implemented by Numeric3) could wrap the buffer object where numarray objects hold its data, the conversion between Numeric3 <--> numarray (or, in general, between those packages that deal with bytes objects and other packages that deal with buffer objects) can be done with a cost of 1 (that is, independent of the data size). If this cannot be done (I mean, to get a safe bytes object from a buffer object and vice-versa), well, it should be a pity. Do you think that would be possible at all? Cheers, --
qo< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data ""
--- Travis Oliphant <oliphant@ee.byu.edu> wrote:
Thank you for your detailed explanations. This is starting to make more sense to me. It is obvious that you understand what we are trying to do, and I pretty much agree with you in how you think it should be done. I think you do a great job of explaining things.
I agree we should come up with a set of names for the interface to arrayobjects. I'm even convinced that offset should be an optional part of the interface (implied 0 if it's not there).
Very cool! You just made my day. I wish I had time to do a good writeup, but I need to catch a flight in a couple hours, and I won't be back behind my computer until Wednesday night. Here is an initial stab: __array_shape__ Required, a sequence (typically tuple) of non-negative int/longs __array_storage__ Required, a buffer or possibly sequence object (list) (Required unless the object support PyBufferProcs directly? I don't have a strong opinion on that one...) A slightly different name to indicate it could be a buffer or sequence object (like a list). Typically buffer. __array_itemtype__ Suggested, but Optional if __array_itemsize__ is present. This attribute probably warrants some discussion... A struct module format string or one of the additional ones that needs to be added. Need to discuss "long double" and "Object". (Capital 'O' for Object, Captial 'D' for long double, Capital 'X' for bit?) If not present or the empty string '', indicates that the array elements can only be treated as blobs and the real data representation must be gotten from some other means. I think doubling the typecode as a convention to denote complex numbers makes some sense (for instance 'ff' is complex float). The struct module convention for denoting native, portable big endian, and portable little endian is concise and documented. __array_itemsize__ Optional if __array_itemtype is present and the value can calculated from struct.calcsize(__array_itemtype__) __array_strides__ Optional if the array data is in a contiguous C layout. Required otherwise. Same length as __array_shape__. Indicates how much to multiply subscripts by to get to the desired position in the storage. A sequence (typically tuple) of ints/longs. These are in byte offsets (not element_size offsets) for most arrays. Special exceptions made for: Tightly packed (8 bits to a byte) bitmask arrays, where they offsets are bit indexes PyObject arrays (lists) where the offsets are indexes They should be byte offsets to handle non-aligned data or data with odd packing. Fortran arrays might be common enough to warrant special casing. We could discuss whether a __array_fortran__ attribute indicates that the array is in contiguous Fortran layout __array_offset__ Optional and defaults to zero. An int/long indicating the offset to treat as the zeroth element __array_complicated__ Optional and defaults to zero/false. This is a kluge to indicate that while yes the data is an array, the storage layout can not be easily described by the shape/strides/offset combination alone. This could warrant some discussion. __array_fortran__ Optional and defaults to zero/false. If you want to represent Fortran arrays without creating a strides for them, this would be necessary. I'd vote to leave it out and stick with strides... These are all just suggestions. Is something important missing? Predicates like iscontiguous(a) and isfortran(a) can all be easily determined from the above. The ndims or rank is simply len(a.__array_shape__). I wish I had more time to respond to some of the other things in your message, but I'm gone until Wednesday night... Cheers, -Scott
On Mar 28, 2005, at 4:30 AM, Travis Oliphant wrote:
Scott,
Thank you for your detailed explanations. This is starting to make more sense to me. It is obvious that you understand what we are trying to do, and I pretty much agree with you in how you think it should be done. I think you do a great job of explaining things. I agree we should come up with a set of names for the interface to arrayobjects. I'm even convinced that offset should be an optional part of the interface (implied 0 if it's not there).
Just to add my two cents, I don't think I ever thought it was necessary to bundle the metadata with the memory object for the reasons Scott outlined. It isn't needed functionally, and there are cases where the same memory may be used in different contexts (as is done with our record arrays). Numarray, when it uses the buffer object, always gets a fresh pointer for the buffer object for every data access. But Scott is right that that pointer is good so long as there isn't a chance for something else to change it. In practice, I don't think that ever happens with the buffers that numarray happens to use, but it's still a flaw of the current buffer object that there is no way to ensure it won't change. I'm not sure how the support for large data sets should be handled. I generally think that it will be very awkward to handle these until Python does as well. Speaking of which... I had been in occasional contact with Martin von Loewis about his work to update Python to handle 64-bit addressing. We weren't planning to handle this in nummarray (nor Numeric3, right Travis or do I have that wrong?) until Python did. A few months ago Martin said he was mostly done. I had a chance to talk to him at Pycon about where that work stood. Unfortunately, it is not turning out to be as easy as he hoped. This is too bad. I have a feeling that this work is going to stall without help on our (numpy community) part to help make the changes or drum beating to make it a higher priority. At the moment the Numeric3 effort should be the most important focus, but I think that after that, this should become a high priority. Perry
Just to add my two cents, I don't think I ever thought it was necessary to bundle the metadata with the memory object for the reasons Scott outlined. It isn't needed functionally, and there are cases where the same memory may be used in different contexts (as is done with our record arrays).
I'm glad we've worked that one out.
Numarray, when it uses the buffer object, always gets a fresh pointer for the buffer object for every data access. But Scott is right that that pointer is good so long as there isn't a chance for something else to change it. In practice, I don't think that ever happens with the buffers that numarray happens to use, but it's still a flaw of the current buffer object that there is no way to ensure it won't change.
One could see it as a "flaw" in the buffer object, but I prefer to see it as problesm with objects that use the PyBufferProcs protocol. It is at worst, a "limitation" of the buffer interface that should be advertised (in my mind the problem lies with the objects that make use of the buffer protocol and also reallocate memory willy-nilly since Python does not allow for this). To me, an analagous situation occurs when an extension module writes into memory it does not own and causes a seg-fault. I suppose a casual observer could say this is a Python flaw but clearly the problem is with the extension object. It certinaly does not mean at all that something like a buffer object should never exist or that the buffer protocol should not be used. I get the feeling sometimes, that some naive (to Numeric and numarray) people on python-dev feel that way.
I'm not sure how the support for large data sets should be handled. I generally think that it will be very awkward to handle these until Python does as well. Speaking of which...
I had been in occasional contact with Martin von Loewis about his work to update Python to handle 64-bit addressing. We weren't planning to handle this in nummarray (nor Numeric3, right Travis or do I have that wrong?) until Python did. A few months ago Martin said he was mostly done. I had a chance to talk to him at Pycon about where that work stood. Unfortunately, it is not turning out to be as easy as he hoped. This is too bad. I have a feeling that this work is going to stall without help on our (numpy community) part to help make the changes or drum beating to make it a higher priority. At the moment the Numeric3 effort should be the most important focus, but I think that after that, this should become a high priority.
I would be interested to hear what the problems are. Why can't you just change the protocol replacing all int's with Py_intptr_t? Is backward compatibilty the problem? This seems like it's on the extension code level (and then only on 64-bit systesm), and so would be easier to force through the change in Python 2.5. Numeric3 will suffer limitations whenever the sequence protocol is used. We can work around it as much as possible (by not using the sequence protocol whenever possible), but the limitation lies firmly in the Python sequence protocol. -Travis
I wish I had time to do a good writeup, but I need to catch a flight in a couple hours, and I won't be back behind my computer until Wednesday night. Here is an initial stab:
__array_shape__ Required, a sequence (typically tuple) of non-negative int/longs
great. I agree.
__array_storage__ Required, a buffer or possibly sequence object (list)
(Required unless the object support PyBufferProcs directly? I don't have a strong opinion on that one...)
A slightly different name to indicate it could be a buffer or sequence object (like a list). Typically buffer.
I prefer __array_data__ (it's a common name for Numeric and numarray, It can be interpreted as a sequence object if desired).
__array_itemtype__ Suggested, but Optional if __array_itemsize__ is present.
I say this one defaults to "V" for void * if not present. And _array_itemsize__ is necessary if it is "S" (string), "U" unicode, or "V". I also like __array_typestr__ or __array_typechar__ better as a name.
A struct module format string or one of the additional ones that needs to be added. Need to discuss "long double" and "Object". (Capital 'O' for Object, Captial 'D' for long double, Capital 'X' for bit?)
Don't like 'D' for long double. Complex floats is already using it. I'm not sure I like the idea of moving to two character typecodes at this point because it indicates more internal changes to Numeric3 (otherwise we have two typecharacter standards which is not a good thing). What is wrong with 'g' and 'G' for long double and complex long double respectively.
If not present or the empty string '', indicates that the array elements can only be treated as blobs and the real data representation must be gotten from some other means.
Again, a void * type handles this well.
The struct module convention for denoting native, portable big endian, and portable little endian is concise and documented.
So, you think we should put the byte-order in the typecharacter interface. Don't know.... could be persuaded.
__array_itemsize__ Optional if __array_itemtype is present and the value can calculated from struct.calcsize(__array_itemtype__)
I think it is only optional if typechar is not 'S', 'U', or 'V'.
__array_strides__ Optional if the array data is in a contiguous C layout. Required otherwise. Same length as __array_shape__. Indicates how much to multiply subscripts by to get to the desired position in the storage.
A sequence (typically tuple) of ints/longs. These are in byte offsets (not element_size offsets) for most arrays. Special exceptions made for: Tightly packed (8 bits to a byte) bitmask arrays, where they offsets are bit indexes
PyObject arrays (lists) where the offsets are indexes
They should be byte offsets to handle non-aligned data or data with odd packing.
Fortran arrays might be common enough to warrant special casing. We could discuss whether a __array_fortran__ attribute indicates that the array is in contiguous Fortran layout
I don't think it is necessary in the interface.
__array_offset__ Optional and defaults to zero. An int/long indicating the offset to treat as the zeroth element
__array_complicated__ Optional and defaults to zero/false. This is a kluge to indicate that while yes the data is an array, the storage layout can not be easily described by the shape/strides/offset combination alone.
This could warrant some discussion.
I don't see the utility here I guess, If it can't be described by a shape/strides combination then how can it participate in the protocol?
__array_fortran__ Optional and defaults to zero/false. If you want to represent Fortran arrays without creating a strides for them, this would be necessary. I'd vote to leave it out and stick with strides...
Me too. We should make the interface as minimal as possible, intially. My proposal: __array_data__ (optional object that exposes the PyBuffer protocol or a sequence object, if not present, the object itself is used). __array_shape__ (required tuple of int/longs that gives the shape of the array) __array_strides__ (optional provides how to step through the memory in bytes (or bits if a bit-array), default is C-contiguous) __array_typestr__ (optional struct-like string showing the type --- optional endianness indicater + Numeric3 typechars, default is 'V') __array_itemsize__ (required if above is 'S', 'U', or 'V') __array_offset__ (optional offset to start of buffer, defaults to 0) So, you could define an array interface with only two additional attributes if your object exposed the buffer or sequence protocol. We should figure out a way to work around the 32-bit limitations of the sequence and buffer protocols as well. -Travis
Scott Gilbert wrote:
__array_itemtype__ Suggested, but Optional if __array_itemsize__ is present.
This attribute probably warrants some discussion...
A struct module format string or one of the additional ones that needs to be added. Need to discuss "long double" and "Object". (Capital 'O' for Object, Captial 'D' for long double, Capital 'X' for bit?)
If not present or the empty string '', indicates that the array elements can only be treated as blobs and the real data representation must be gotten from some other means.
I think doubling the typecode as a convention to denote complex numbers makes some sense (for instance 'ff' is complex float).
The struct module convention for denoting native, portable big endian, and portable little endian is concise and documented.
After more thought, I think here we need to also allow the "c-type" independent way of describing an array (i.e. numarray-introduced 'c4' for a complex-valued 4 byte itemsize array). So, pehaps __array_ctypestr_ and __array_typestr__ should be two ways to get the information (or overload the __array_typestr__ interface and reequire consumers to accept either style). -Travis
A Dilluns 28 Març 2005 23:54, Perry Greenfield va escriure:
Numarray, when it uses the buffer object, always gets a fresh pointer for the buffer object for every data access. But Scott is right that that pointer is good so long as there isn't a chance for something else to change it. In practice, I don't think that ever happens with the buffers that numarray happens to use, but it's still a flaw of the current buffer object that there is no way to ensure it won't change.
However, having to update the pointer for the buffer object for every data access does impact performance quite a lot. This issue has been brought up to this list some months ago (see [1]). I, as for one, have renounced to call NA_updateDataPtr() during table reads in PyTables and this speeded up the reading process by 70%, which is not a joke. And this speed-up could be theoretically achieved in every piece of code that reads like: for i range(n): a = numarrayobject[i] that is, whenever a single element in array is accessed. If the bytes object suggested by Scott makes the call to NA_updateDataPtr() unnecessary then this is an added advantage of bytes over buffer. [1] http://sourceforge.net/mailarchive/message.php?msg_id=8848962 Cheers, --
qo< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data ""
On Mar 28, 2005, at 6:25 PM, Travis Oliphant wrote:
One could see it as a "flaw" in the buffer object, but I prefer to see it as problesm with objects that use the PyBufferProcs protocol. It is at worst, a "limitation" of the buffer interface that should be advertised (in my mind the problem lies with the objects that make use of the buffer protocol and also reallocate memory willy-nilly since Python does not allow for this). To me, an analagous situation occurs when an extension module writes into memory it does not own and causes a seg-fault. I suppose a casual observer could say this is a Python flaw but clearly the problem is with the extension object.
It certinaly does not mean at all that something like a buffer object should never exist or that the buffer protocol should not be used. I get the feeling sometimes, that some naive (to Numeric and numarray) people on python-dev feel that way.
Certainly there needs to be something like this (that's why we used it for numarray after all).
I'm not sure how the support for large data sets should be handled. I generally think that it will be very awkward to handle these until Python does as well. Speaking of which...
I had been in occasional contact with Martin von Loewis about his work to update Python to handle 64-bit addressing. We weren't planning to handle this in nummarray (nor Numeric3, right Travis or do I have that wrong?) until Python did. A few months ago Martin said he was mostly done. I had a chance to talk to him at Pycon about where that work stood. Unfortunately, it is not turning out to be as easy as he hoped. This is too bad. I have a feeling that this work is going to stall without help on our (numpy community) part to help make the changes or drum beating to make it a higher priority. At the moment the Numeric3 effort should be the most important focus, but I think that after that, this should become a high priority.
I would be interested to hear what the problems are. Why can't you just change the protocol replacing all int's with Py_intptr_t? Is backward compatibilty the problem? This seems like it's on the extension code level (and then only on 64-bit systesm), and so would be easier to force through the change in Python 2.5.
As Martin explained it, he said there is a lot of code that uses int declarations. If you are saying that it would be easy just to replace all int declarations in Python, I doubt it is that simple since there are calls to many other libraries that must use ints. So it means that there are thousands (so Martin says) of declarations that one must change by hand. It has to be changed for strings, lists, tuples and everything that uses them (Guido was open to doing this but everything had to be updated at once, not just strings or certain objects, and he is certainly right about that). Martin also said that we would need a system with enough memory to test all of these. Lists in particular would need a system with 16GB of memory to test lists that use more than the current limit (because of the size of list objects). I'm not sure I agree with that. It would be nice to have that kind of test, but I think it would be reasonable to have tested on the largest memory systems available at the time for our testing. If there are latent list sequence bugs that surface when 16 GB systems become available, then the bugs can be dealt with at that time (IMHO). (Anybody out there have a system with that much memory available for test purposes :-). Of course, this change will change the C API for Python too as far as sequence use goes (or is there some way around that? A compatibility API and a new one that supports extended indices?) It would be nice if there were some way of handling that gracefully without requiring all extensions to have to change to match this. I imagine that this is going to be the biggest objection to making any changes unless the old API is supported for a while. Perhaps someone has thought this all out already. I haven't thought about it at all. Perry
On Mar 28, 2005, at 6:59 PM, Travis Oliphant wrote:
The struct module convention for denoting native, portable big endian, and portable little endian is concise and documented.
So, you think we should put the byte-order in the typecharacter interface. Don't know.... could be persuaded.
I think we need to think about what the typecharacter is supposed to represent. Is it the value as the user will see it or to indicate what the internal representation is? These are two different things. Then again, I'm not sure how this info is exposed to the user; if it is appropriately handled by intermediate code it may not matter. For example, if this corresponds to what the user will see for the type, I think it is bad. Most of the time they don't care what the internal representation is, they just want to know if it is Int16 or whatever; with the two combined, they have to test for both variants. Perry
__array_storage__
How about __array_data__? -- Magnus Lie Hetland Fall seven times, stand up eight http://hetland.org [Japanese proverb]
Travis Oliphant <oliphant@ee.byu.edu>: [snip]
My proposal:
__array_data__ (optional object that exposes the PyBuffer protocol or a sequence object, if not present, the object itself is used). __array_shape__ (required tuple of int/longs that gives the shape of the array) __array_strides__ (optional provides how to step through the memory in bytes (or bits if a bit-array), default is C-contiguous) __array_typestr__ (optional struct-like string showing the type --- optional endianness indicater + Numeric3 typechars, default is 'V') __array_itemsize__ (required if above is 'S', 'U', or 'V') __array_offset__ (optional offset to start of buffer, defaults to 0)
So, you could define an array interface with only two additional attributes if your object exposed the buffer or sequence protocol.
Wohoo! Niiice :) (Okay, a bit "me too"-ish, but I just wanted to contribute some enthusiasm ;) -- Magnus Lie Hetland Fall seven times, stand up eight http://hetland.org [Japanese proverb]
A Dimarts 29 Març 2005 15:23, Francesc Altet va escriure:
This issue has been brought up to this list some months ago (see [1]). I, as for one, have renounced to call NA_updateDataPtr() during table reads in PyTables and this speeded up the reading process by 70%, which is not a joke. And this speed-up could be theoretically achieved in every piece of code that reads like:
for i range(n): a = numarrayobject[i]
that is, whenever a single element in array is accessed.
Well, the statement above is not exactly true. The overhead introduced by NA_updateDataPtr (and other functions related with the buffer object) is mainly important when you call the __getitem__ method from *extensions* and less important (but yet significant!) when you are in pure Python. This evening I wanted to evaluate how much would be the acceleration if it would be not necessary to call NA_updateDataPtr and companions (i.e. getting rid of the buffer object), found some interesting results and ended doing a quite long report that took this sunny Spring evening away from me :( Despite its rather serious format, please, don't look at it as a serious demonstration of nothing. It was made basically because I need maximum performance on __getitem__ operations and was curious on what Numeric/numarray/Numeric3 can offer in that regard. If I'm publishing it here is because it could of help for somebody. Cheers, --
qo< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data ""
A note on __getitem__ performance on Numeric/numarray on Python extensions (with an small follow-up on Numeric3) ========================================================================== Francesc Altet 2005-03-29 Abstract ======== Numeric [1] and numarray [2] are Python packages that provide very convenient containers to deal with large amounts of data in memory in an efficient way. The fact that they have quite different implementations lends naturally to areas where one package is better suited than the other, and vice-versa. In fact, it is a luck to have such a duality because competence is basic on every software (sane) ecosystem. The best way of determining which package is better adapted to do a certain task is benchmarking. In this report, I have made use of Pyrex [3] and oprofile [4] in order to decide which is the best candidate to be used for accessing the data in the containers from C extensions. In the appendix, some attention has been dedicated as well to Numeric3, a new-born contender for Numeric and numarray. Motivation ========== I need peak performance when accessing to data belonging to Numeric/numarray objects in my extensions, so I decided to do some profiling on the next code, which is representative of my own needs: niter = 5 N = 1000*1000 def matrix_loop(object): for j in xrange(niter): for i in xrange(N): p = object[i] This basically exercises the __getitem__ special method in Numeric/numarray objects. The benchmark ============= In order to get some comparisons done, I've made a small script (getitem-numarrayVSNumeric.py) that checks the speed for both kinds of objects: Numeric and numarray. Also, and in order to reduce the Python overhead, I've used psyco [3] so that the results may get as close as possible as if these tests were running inside a Python extension (made in C). Moreover, I've used the oprofile [4] so as to get an idea of where the CPU is wasted in this loop. First of all, I've made a calibration test to measure the time of the empty loop, that is: def null_loop(): for j in xrange(niter): for i in xrange(N): pass This time is almost negligible when running with Psyco (and the same happens inside a C extension), but it takes a *significant* time if psyco is not active. Once this time has been measured, it is substracted from the loops that actually exercise __getitem__. First (naive) timings ===================== Now, let's see some of the timings that I've done. My platform is a Pentium4 @ 2GHZ laptop, using Debian GNU/Linux and kernel 2.6.9 and with gcc 3.3.5. First of all, I'll list the results without psyco: $ python2.3 bench/getitem-numarrayVSNumeric.py Psyco not active Numeric version: 23.8 numarray version: 1.2.3 Calibration loop: 0.11173081398 Time for numarray(getitem)/iter: 3.82528972626e-07 Time for Numeric(getitem)/iter: 2.51150989532e-07 getitem in Numeric is 1.52310358537 times faster We can see how the time per iteration for numarray is 380 ns while for Numeric is 250 ns, which accounts for a 1.5x speed-up of Numeric vs numarray. Using psyco to reduce Python overhead ===================================== However, and even though we have substracted the time for the calibration loop, there may remain other places were time is wasted in Python space. Psyco is a good manner to optimize loops and make them go almost as fast as in C. Now, the figures using psyco: $ python2.3 bench/getitem-numarrayVSNumeric.py Psyco active Numeric version: 23.8 numarray version: 1.2.3 Calibration loop: 0.0015878200531 Time for numarray(getitem)/iter: 2.4246096611e-07 Time for Numeric(getitem)/iter: 1.19336557388e-07 getitem in Numeric is 2.0317409134 times faster We can see how the time for the calibration loop has been improved a factor 100x. Not too bad for a silly loop. Also, the time per iteration for numarray has dropped to 242 ns and to 119 ns for Numeric. This accounts for a 2x speedup. The first conclusion is that numarray is considerably slower than Numeric when accessing its data. Besides, when using psyco, part of the Python overhead evaporates, making the gap between Numeric and numarray loops to grow. Introducing oprofile: getting a broad view of what's going on ============================================================= In order to measure the exact difference of __getitem__ method without the Python overhead (in an extension, for example) I've used oprofile against the psyco version of the benchmark. Here is the result for the run with psyco and profiled with oprofile: # opreport /usr/bin/python2.3 samples| %| ------------------ 586 34.1293 libnumarray.so 454 26.4415 python2.3 331 19.2778 _numpy.so 206 11.9977 _ndarray.so 102 5.9406 memory.so 22 1.2813 libc-2.3.2.so 9 0.5242 ld-2.3.2.so 4 0.2330 multiarray.so 2 0.1165 _sort.so 1 0.0582 _psyco.so libnumarray.so, _ndarray.so, memory.so and _sort.so shared libraries all belongs to numarray package. The _numpy.so and multiarray.so fall into Numeric. The time spent in python space is very little (just a 26%, in a great deal thanks to psyco acceleration). The libc-2.3.2.so and ld-2.3.2.so belongs to the C runtime library, and it is not possible to decide whether this time has been used by numarray, Numeric or Python itself, but as the time consumed is very little, we can safely ignore it. So, if we sum the samples when the CPU was in the C space (the shared libs) in numarray, and compare against the time in C space in Numeric, we get that this is 894 against 331, which means that Numeric is 2.7x faster than numarray for __getitem__. Of course, this is more than 1.5x and 2x factor that we get earlier because of the time spent in python space. However, the 2.7x factor is probably more accurate when one wants to exercise __getitem__ in C extensions. Most CPU intensive functions using oprofile ========================================== If we want to look at the most consuming functions in numarray: # opstack -t 1 /usr/bin/python2.3 | sort -nr| head -10 454 26.6432 python2.3 (no symbols) 331 19.4249 _numpy.so (no symbols) 145 8.5094 libnumarray.so NA_getPythonScalar 115 6.7488 libnumarray.so NA_getByteOffset 101 5.9272 libnumarray.so isBufferWriteable 98 5.7512 _ndarray.so _ndarray_subscript 91 5.3404 _ndarray.so _simpleIndexingCore 73 4.2840 libnumarray.so NA_updateDataPtr 64 3.7559 memory.so memory_getbuf 60 3.5211 libnumarray.so getReadBufferDataPtr The _numpy.so was stripped out of debugging info, so we can't see where the time was spent in Numeric. However, we can estimate the cost for getting a fresh pointer for the data buffer for every data access in numarray: isBufferWriteable+NA_updateDataPtr+memory_getbuf+getReadBufferDataPtr gives a total of 298 samples, which is almost as much as all the time spent by the Numeric shared library (331). So we can conclude that having a buffer object in our array object can be a serious drawback if we want to get maximum performance for accessing the data. Another point that can be worth to look at is in NA_getByteOffset that takes 115 samples by itself. This is perhaps a little too much. Conclusions =========== To sum up, we can expect that the __getitem__ method in Numeric would be 1.5x times faster than numarray in pure python code, 2x when using Psyco, and 2.7x times faster when used in C extensions. One factor that (partially) explain that numarray is slower in this area is that it is based on the buffer interface to keep its data. This feature, while very convenient for certain tasks (like sharing data with other Python packages or extensions), has a limitation that make an extension to crash if the memory buffer is reallocated. Other solutions (like the "bytes" object [5]) has been proposed to overcome this limitation (and others) of the buffer interface. Numeric3 might choose this to avoid these kind of contention problems created by the buffer interface. Finally, we have seen how using oprofile could be of unvaluable help for determining where the hot spots are, not only in our extensions, but also in other shared libraries in our system. If the shared libraries also have debugging info on them, then it would be possible to track down even the most expensive routines in our application. Appendix ======== Even though it is in the very early stages of existence, I was curious about how Numeric3 [3] would perform in comparison with Numeric. By slightly changing getitem-numarrayVSNumeric.py, I've come up with getitem-NumericVSNumeric3.py, which do the comparison I wanted to. When running without psyco, I got: $ python2.3 bench/getitem-NumericVSNumeric3.py Psyco not active Numeric version: 23.8 Numeric3 version: Very early alpha release...! Calibration loop: 0.107951593399 Time for Numeric3(getitem)/iter: 1.18472018242e-06 Time for Numeric(getitem)/iter: 2.45458602905e-07 getitem in Numeric is 4.82655799551 times faster Ops, Numeric3 is almost 5 times slower than Numeric. So it really seems to be still in very alpha (you know, premature optimization is the root of all evils). Never mind, this is just an exercise. So, let's continue with the psyco version: $ python2.3 bench/getitem-NumericVSNumeric3.py Psyco active Numeric version: 23.8 Numeric3 version: Very early alpha release...! Calibration loop: 0.00171356201172 Time for Numeric3(getitem)/iter: 1.04013824463e-06 Time for Numeric(getitem)/iter: 1.19578647614e-07 getitem in Numeric is 8.69836099828 times faster The gap has increased to 8.6x as expected. Let's have a look at the most consuming shared libs by using oprofile: # opreport /usr/bin/python2.3 samples| %| ------------------ 1841 33.7365 multiarray.so 1701 31.1710 libc-2.3.2.so 1586 29.0636 python2.3 318 5.8274 _numpy.so 6 0.1100 ld-2.3.2.so 3 0.0550 multiarray.so 2 0.0367 _psyco.so God! two libraries alone are getting more than half of the CPU: multiarray.so and libc-2.3.2.so. As we already know that Numeric3 __getitem__ takes much more time than its counterpart in Numeric, we can conclude that Numeric3 comes with its own multiarray.so, and that it is responsible for taking one third (33.7%) of the time. Moreover, multiarray.so should be the responsible to be calling the libc routines so much, because in our previous benchmarks, the libc calls never took more than 5% of the time, and here is taking more than 30%. To conclude, let's see which are the most consuming routines in Numeric3 for this exercise: # opstack -t 1 /usr/bin/python2.3 | sort -nr| head -20 1586 30.1750 python2.3 (no symbols) 669 12.7283 libc-2.3.2.so __GI___strcasecmp 618 11.7580 multiarray.so PyArray_MapIterNew 374 7.1157 multiarray.so array_subscript 318 6.0502 _numpy.so (no symbols) 260 4.9467 libc-2.3.2.so __realloc 190 3.6149 libc-2.3.2.so _int_malloc 172 3.2725 multiarray.so PyArray_New 152 2.8919 libc-2.3.2.so __strncasecmp 123 2.3402 libc-2.3.2.so malloc_consolidate 121 2.3021 libc-2.3.2.so __memalign_internal 118 2.2451 multiarray.so array_dealloc 102 1.9406 libc-2.3.2.so _int_realloc 93 1.7694 multiarray.so fancy_indexing_check 86 1.6362 multiarray.so arraymapiter_dealloc 79 1.5030 multiarray.so PyArray_Scalar 76 1.4460 multiarray.so LONG_copyswapn 62 1.1796 multiarray.so PyArray_UpdateFlags 57 1.0845 multiarray.so PyArray_DescrFromType While we can see that a lot of time is spent inside the multiarray.so of Numeric3 it also catch our attention that a lot of time is spent doing the __GI___strcasecmp system call. This is very strange, because our arrays are made of integers and calling strcasecmp on each iteration seems like very unnecessary. In order to know who is calling strcasecmp (i.e. get the call tree), oprofile needs a special patched version of the linux kernel. But this is material for another story. References ========== [1] http://numpy.sourceforge.net/ [2] http://stsdas.stsci.edu/numarray/ [3] http://psyco.sourceforge.net/ [4] http://oprofile.sourceforge.net/ [5] http://www.python.org/peps/pep-0296.html
There are two distinct issues with regards to large arrays. 1) How do you support > 2Gb memory mapped arrays on 32 bit systems and other large-object arrays only a part of which are in memory at any given time (there is an equivalent problem for > 8 Eb (exabytes) on 64 bit systems, an Exabyte is 2^60 bytes or a giga-giga-byte). 2) Supporting the sequence protocol for in-memory objects on 64-bit systems. Part 2 can be fixed using the recommendations Martin is making and which will likely happen (though it could definitely be done faster). Handling part 1 is more difficult. One idea is to define some kind of "super object" that mediates between the large file and the in-memory portion. In other words, the ndarray is an in-memory object, while the super object handles interfacing it with a larger structure. Thoughts? -Travis
On Mar 29, 2005, at 9:11 PM, Travis Oliphant wrote:
There are two distinct issues with regards to large arrays.
1) How do you support > 2Gb memory mapped arrays on 32 bit systems and other large-object arrays only a part of which are in memory at any given time (there is an equivalent problem for > 8 Eb (exabytes) on 64 bit systems, an Exabyte is 2^60 bytes or a giga-giga-byte).
2) Supporting the sequence protocol for in-memory objects on 64-bit systems.
Part 2 can be fixed using the recommendations Martin is making and which will likely happen (though it could definitely be done faster). Handling part 1 is more difficult.
One idea is to define some kind of "super object" that mediates between the large file and the in-memory portion. In other words, the ndarray is an in-memory object, while the super object handles interfacing it with a larger structure.
Thoughts?
Maybe I'm missing something but isn't it possible to mmap part of a large file? In that case one just limits the memory maps to what can be handled on a 32 bit system leaving it up to the user software to determine which part of the file to mmap. Did you have something more automatic in mind? As for other large-object arrays I'm not sure what other examples there are other than memory mapping. Do you have any? Perry
A Dimarts 29 Març 2005 01:59, Travis Oliphant va escriure:
My proposal:
__array_data__ (optional object that exposes the PyBuffer protocol or a sequence object, if not present, the object itself is used). __array_shape__ (required tuple of int/longs that gives the shape of the array) __array_strides__ (optional provides how to step through the memory in bytes (or bits if a bit-array), default is C-contiguous) __array_typestr__ (optional struct-like string showing the type --- optional endianness indicater + Numeric3 typechars, default is 'V') __array_itemsize__ (required if above is 'S', 'U', or 'V') __array_offset__ (optional offset to start of buffer, defaults to 0)
Considering that heterogenous data is to be suported as well, and there is some tradition of assigning names to the different fields, I wonder if it would not be good to add something like: __array_names__ (optional comma-separated names for record fields) Cheers, --
qo< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data ""
A Dimarts 29 Març 2005 01:59, Travis Oliphant va escriure:
My proposal:
__array_data__ (optional object that exposes the PyBuffer protocol or a sequence object, if not present, the object itself is used). __array_shape__ (required tuple of int/longs that gives the shape of the array) __array_strides__ (optional provides how to step through the memory in bytes (or bits if a bit-array), default is C-contiguous) __array_typestr__ (optional struct-like string showing the type --- optional endianness indicater + Numeric3 typechars, default is 'V') __array_itemsize__ (required if above is 'S', 'U', or 'V') __array_offset__ (optional offset to start of buffer, defaults to 0)
Considering that heterogenous data is to be suported as well, and there is some tradition of assigning names to the different fields, I wonder if it would not be good to add something like:
__array_names__ (optional comma-separated names for record fields)
I'm O.K. with that. After more thought, I think using the struct-like typecharacters is not a good idea for the array protocol. I think that the character codes used by the numarray record array: kind_character + byte_width is better. Commas can separate heterogeneous data. The problem is that if the data buffer originally came from a different machine or saved with a different compiler (e.g. a mmap'ed file), then the struct-like typecodes only tell you the c-type that machine thought the data was. It does not tell you how to interpret the data on this machine. So, I think we should use the __array_typestr__ method to pass type information using the kind_character + byte_width method. I'm also going to use this type information for pickles, so that arrays pickled on one machine type will be able to be interpreted on another with ease. Bool -- "b%d" % sizeof(bool) Signed Integer -- "i%d" % sizeof(<some int>) Unsigned Integer -- "u%d" % sizeof(<some uint>) Float -- "f%d" % sizeof(<some float>) Complex -- "c%d" % sizeof(<some complex>) Object -- "O%d" % sizeof(PyObject *) --- this would only be useful on shared memory String -- "S%d" % itemsize Unicode -- "U%d" % itemsize Void -- "V%d" % itemsize I also think that rather than attach < or > to the start of the string it would be easier to have another protocol for endianness. Perhaps something like: __array_endian__ (optional Python integer with the value 1 in it). If it is not 1, then a byteswap must be necessary. -Travis
After more thought, I think using the struct-like typecharacters is not a good idea for the array protocol. I think that the character codes used by the numarray record array: kind_character + byte_width is better. Commas can separate heterogeneous data. The problem is that if the data buffer originally came from a different machine or saved with a different compiler (e.g. a mmap'ed file), then the struct-like typecodes only tell you the c-type that machine thought the data was. It does not tell you how to interpret the data on this machine. So, I think we should use the __array_typestr__ method to pass type information using the kind_character + byte_width method. I'm also going to use this type information for pickles, so that arrays pickled on one machine type will be able to be interpreted on another with ease.
Bool -- "b%d" % sizeof(bool) Signed Integer -- "i%d" % sizeof(<some int>) Unsigned Integer -- "u%d" % sizeof(<some uint>) Float -- "f%d" % sizeof(<some float>) Complex -- "c%d" % sizeof(<some complex>) Object -- "O%d" % sizeof(PyObject *) --- this would only be useful on shared memory String -- "S%d" % itemsize Unicode -- "U%d" % itemsize Void -- "V%d" % itemsize
Of course with this protocol for the typestr, the array_itemsize is redundant and can disappear. Another reason to like it.
I also think that rather than attach < or > to the start of the string it would be easier to have another protocol for endianness. Perhaps something like:
__array_endian__ (optional Python integer with the value 1 in it). If it is not 1, then a byteswap must be necessary.
I'm mixed on this, I could be persuaded either way. -Travis
Francesc Altet <faltet@carabos.com> writes:
A Dimarts 29 Març 2005 01:59, Travis Oliphant va escriure:
My proposal:
__array_data__ (optional object that exposes the PyBuffer protocol or a sequence object, if not present, the object itself is used). __array_shape__ (required tuple of int/longs that gives the shape of the array) __array_strides__ (optional provides how to step through the memory in bytes (or bits if a bit-array), default is C-contiguous) __array_typestr__ (optional struct-like string showing the type --- optional endianness indicater + Numeric3 typechars, default is 'V') __array_itemsize__ (required if above is 'S', 'U', or 'V') __array_offset__ (optional offset to start of buffer, defaults to 0)
Considering that heterogenous data is to be suported as well, and there is some tradition of assigning names to the different fields, I wonder if it would not be good to add something like:
__array_names__ (optional comma-separated names for record fields)
A sequence (list or tuple) of strings would be preferable. That removes all worrying about using commas in the names. -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |cookedm@physics.mcmaster.ca
David M. Cooke wrote:
Francesc Altet <faltet@carabos.com> writes:
A Dimarts 29 Març 2005 01:59, Travis Oliphant va escriure:
My proposal:
__array_data__ (optional object that exposes the PyBuffer protocol or a sequence object, if not present, the object itself is used). __array_shape__ (required tuple of int/longs that gives the shape of the array) __array_strides__ (optional provides how to step through the memory in bytes (or bits if a bit-array), default is C-contiguous) __array_typestr__ (optional struct-like string showing the type --- optional endianness indicater + Numeric3 typechars, default is 'V') __array_itemsize__ (required if above is 'S', 'U', or 'V') __array_offset__ (optional offset to start of buffer, defaults to 0)
Considering that heterogenous data is to be suported as well, and there is some tradition of assigning names to the different fields, I wonder if it would not be good to add something like:
__array_names__ (optional comma-separated names for record fields)
A sequence (list or tuple) of strings would be preferable. That removes all worrying about using commas in the names.
As I understand it, record arrays can be heterogenous. If so, wouldn't it make sense for this to be a sequence of tuples? For example: [('Name', charStringType), ('Age', _nt.Int8), ...] Where _nt is defined by something like: import numarray.numerictypes as _nt Colin W.
I don't know if you have followed the array interface discussion. It is defined at http://numeric.scipy.org I have implemented consumer and exporter interfaces for Numeric and an exporter interface for numarray. The consumer interface needs a little help but shouldn't take too long for someone who understands numarray better. Now Numeric arrays can share data with numarray (no data copy). scipy.base arrays will also implement the array interface. I think the array interface is a good direction to go. -Travis
participants (7)
-
Colin J. Williams
-
cookedm@physics.mcmaster.ca
-
Francesc Altet
-
Magnus Lie Hetland
-
Perry Greenfield
-
Scott Gilbert
-
Travis Oliphant