![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
I'm attaching my latest extended buffer-protocol PEP that is trying to get the array interface into Python. Basically, it is a translation of the numpy header files into something as simple as possible that can still be used to describe a complicated block of memory to another user. My purpose is to get feedback and criticisms from this community before display before the larger Python community. -Travis PEP: <unassigned> Title: Extending the buffer protocol to include the array interface Version: $Revision: $ Last-Modified: $Date: $ Author: Travis Oliphant <oliphant@ee.byu.edu> Status: Draft Type: Standards Track Created: 28-Aug-2006 Python-Version: 2.6 Abstract This PEP proposes extending the tp_as_buffer structure to include function pointers that incorporate information about the intended shape and data-format of the provided buffer. In essence this will place an array interface directly into Python. Rationale Several extensions to Python utilize the buffer protocol to share the location of a data-buffer that is really an N-dimensional array. However, there is no standard way to exchange the additional N-dimensional array information so that the data-buffer is interpreted correctly. The NumPy project introduced an array interface (http://numpy.scipy.org/array_interface.shtml) through a set of attributes on the object itself. While this approach works, it requires attribute lookups which can be expensive when sharing many small arrays. One of the key reasons that users often request to place something like NumPy into the standard library is so that it can be used as standard for other packages that deal with arrays. This PEP provides a mechanism for extending the buffer protocol (which already allows data sharing) to add the additional information needed to understand the data. This should be of benefit to all third-party modules that want to share memory through the buffer protocol such as GUI toolkits, PIL, PyGame, CVXOPT, PyVoxel, PyMedia, audio libraries, video libraries etc. Proposal Add bf_getarrview and bf_relarrview function pointers to the buffer protocol to allow objects to share a view on a memory pointer including information about accessing it as an N-dimensional array. Add the TP_HAS_ARRAY_BUFFER flag to types that define this extended buffer protocol. Also a few additionsl C-API calls should perhaps be added to Python to facilitate creating new PyArrViewObjects. Specification: static PyObject* bf_getarrayview (PyObject *obj) This function must return a new reference to a PyArrViewObject which contains the details of the array information exposed by the object. If failure occurs, then NULL is returned and an exception set. static int bf_relarrayview(PyObject *obj) If not NULL then this will be called when the object returned by bf_getarrview is destroyed so that the underlying object can be aware when acquired "views" are released. The object that defines bf_getarrview should not re-allocate memory (re-size itself) while views are extant. A 0 is returned on success and a -1 and an error condition set on failure. The PyArrayViewObject has the structure typedef struct { PyObject_HEAD void *data; /* pointer to the beginning of data */ int nd; /* the number of dimensions */ Py_ssize_t *shape; /* c-array of size nd giving shape */ Py_ssize_t *strides; /* SEE BELOW */ PyObject *base; /* the object this is a "view" of */ PyObject *format; /* SEE BELOW */ int flags; /* SEE BELOW */ } PyArrayViewObject; strides -- a c-array of size nd providing the striding information which is the number of bytes to skip to get to the next element in that dimension. format -- a Python data-format object (PyDataFormatObject) which contains information about how each item in the array should be interpreted. flags -- an integer of flags. PYARR_WRITEABLE is the only flag that must be set appropriately by types. Other flags: PYARR_ALIGNED, PYARR_C_CONTIGUOUS, PYARR_F_CONTIGUOUS, and PYARR_NOTSWAPPED can all be determined from the rest of the PyArrayViewObject using the UpdateFlags C-API. The PyDataFormatObject has the structure typedef struct { PyObject_HEAD PySimpleformat primitive; /* basic primitive type */ int flags; /* byte-order, isaligned */ int itemsize; /* SEE BELOW */ int alignment; /* SEE BELOW */ PyObject *extended; /* SEE BELOW */ } PyDataFormatObject; enum Pysimpleformat {PY_BIT='1', PY_BOOL='?', PY_BYTE='b', PY_SHORT='h', PY_INT='i', PY_LONG='l', PY_LONGLONG='q', PY_UBYTE='B', PY_USHORT='H', PY_UINT='I', PY_ULONG='L', PY_ULONGLONG='Q', PY_FLOAT='f', PY_DOUBLE='d', PY_LONGDOUBLE='g', PY_CFLOAT='F', PY_CDOUBLE='D', PY_CLONGDOUBLE='G', PY_OBJECT='O', PY_CHAR='c', PY_UCS2='u', PY_UCS4='w', PY_FUNCPTR='X', PY_VOIDPTR='V'}; Each of these simple formats has a special character code which can be used to identify this primitive in a nested python list. flags -- flags for the data-format object. Specified masks are PY_NATIVEORDER PY_BIGENDIAN PY_LITTLEENDIAN PY_IGNORE itemsize -- the total size represented by this data-format in bytes unless the primitive is PY_BIT in which case it is the size in bits. For data-formats that are simple 1-d arrays of the underlying primitive, this total size can represent more than one primitive (with extended still NULL). alignment -- For the primitive types this is offsetof(struct {char c; type v;},v) extended -- NULL if this is a primitive data-type or no additional information is available. If primitive is PY_FUNCPTR, then this can be a tuple with >=1 element: (args, {dim0, dim1, dim2, ...}). args -- A list (of at least length 2) of data-format objects specifying the input argument formats with the last argument specifying the output argument data-format (use None for void inputs and/or outputs). For other primitives, this can be a tuple with >=2 elements: (names, fields, {dim0, dim1, dim2, ...}) Use None for both names and fields if they should be ignored. names -- An ordered list of string or unicode objects giving the names of the fields for a structure data-format. fields -- a Python dictionary with ordered-keys given by the list in names. Each entry in the dictionary is a 3-tuple containing (data-format-object, offset, meta-information) where meta-information is Py_None if there is no meta-information. Offset is given in bytes from the start of the record or in bits if PY_BIT is the primitive. Any additional entries in the extended tuple (dim0, dim1, etc.) are interpreted as integers which specify that this data-format is an array of the given shape of the fundamental data-format specified by the remainder of the DataFormat Object. The dimensions are specified so that the last-index is always assumed to vary the fastest (C-order). The constructor of a PyArrViewObject allocates the memory for shape and strides and the destructor frees that memory. The constructor of a PyDataFormatObject allocates the objects it needs for fields, names, and shape. C-API void PyArrayView_UpdateFlags(PyObject *view, int flags) /* update the flags on the array view object provided */ PyDataFormatObject *Py_NewSimpleFormat(Pysimpleformat primitive) /* return a new primitive data-format object */ PyDataFormatObject *Py_DataFormatFromCType(PyObject *ctype) /* return a new data-format object from a ctype */ int Py_GetPrimitiveSize(Pysimpleformat primitive) /* return the size (in bytes) of the provided primitive */ PyDataFormatObject *Py_AlignDataFormat(PyObject *format) /* take a data-format object and construct an aligned data-format object where all fields are aligned on appropriate boundaries for the compiler */ Discussion The information provided in the array view object is patterned after the way a multi-dimensional array is defined in NumPy -- including the data-format object which allows a variety of descriptions of memory depending on the need. Reference Implementation Supplied when the PEP is accepted. Copyright This document is placed in the public domain.
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Travis, First, thanks for doing this -- Python really needs it!
Ah, I do like reducing that overhead -- I know I use arrays a lot for small data sets too, so that overhead can be significant. I'm not well qualified to review the tech details, but to make sure I have this right:
So If I have some C code that wants to use any array passed in, I can just call: bf_getarrayview (obj) and if it doesn't return NULL, I have a valid array that I can query to see if it fits what I'm expecting. Have I got that right? If so, this would be great. By the way,, how compatible is this with the existing buffer protocol? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Christopher Barker wrote:
Yes you could call this (but you would call it from the type object like this obj->ob_type->tp_as_buffer->bf_getarrayview(obj) Or more likely (and I should add this to the C-API) you would call. PyArrayView_FromObject(obj) which does this under the covers.
It's basically orthogonal. In other-words, if you defined the array view protocol you would not need the buffer protocol at all. But you could easily define both. -Travis
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
yes, that's what I'm looking for -- please do add that to the C-API
OK, so if one of these were passed into something expecting the buffer protocol, then it wouldn't work, but you could make an object conform to both protocols at once -- like numpy does now, I suppose -- very nice. Another question -- is this new approach in response to feedback from Guido and or other Python devs? This sure seems like a good way to go -- though it seems from the last discussion I followed at python-dev, most of the devs just didn't get how useful this would be! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
The new approach is a response to the devs and to Sasha who had some relevant comments. Yes, I agree that many devs don't "get" how useful it would be because they have not written scientific or graphics-intensive applications. However, Guido is the one who encouraged me at SciPy 2006 to push this and so I think he is generally favorable to the idea. The Python devs will definitely push back. The strongest opposition seems to be from people that don't 'get' it and so don't want "dead interfaces" in Python. They would need to be convinced of how often such an interface would actually get used. I've tried to do that in the rationale, but many people actually posting to python-dev to support the basic idea (you don't have to support the specific implementation --- most are going to be uncomfortable knowing enough to make a stand). However, there is a great need for people to stand up and say: "We need something like this in Python..." -Travis
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
good sign.
well, none of us want dead interfaces in Python.
let us know when there is a relevant thread to chime in on. However, what we really need is not people like me saying "I need this", but rather people that develop significant existing extension packages saying they'll actually use this in their package. People like: wxPython -- Robin Dunn PIL -- Fredrik Lundh PyOpenGL -- Who? PyObjC -- would it be useful there? (Ronald Oussoren) MatplotLib (but maybe it's already married to numpy...) PyGtk ? Who else? I know Robin Dunn is interested in using it in wxPython -- but probably only if someone contributes the code. I hope to do that some day, but I'm only barely qualified to do so. Fredrik accepted your submission of code to use the array interface in PIL, but he seemed skeptical of the idea. Perhaps lobbying (or even just surveying) some of these folks would be useful. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Christopher Barker wrote:
It's a good start, but their is also PyMedia, PyVoxel, any video-library interface writers, any audo-library interface writers. Anybody who wants to wrap/write code that does some kind of manipulation on a chunk of data of a specific data-format. There are so many people who would use it that I don't feel qualified to speak for them all. -Travis
![](https://secure.gravatar.com/avatar/5c7407de6b47afcd3b3e2164ff5bcd45.jpg?s=120&d=mm&r=g)
A Divendres 05 Gener 2007 01:36, Travis Oliphant escrigué:
Yeah. I think this is the case for PyTables. However, PyTables case should be similar to matplotlib: it needs so many features of NumPy that it is just conceivable it can live with just an implementation of the array interface. In any case, I think that if the PEP has success, it would suppose an extraordinary leap towards efficient data interchange in applications that doesn't need (or are reluctant to include) NumPy for their normal operation. Cheers, --
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
right -- I didn't intend that to be a comprehensive list.
I think this is key -- we all know that there are a lot of people that *could* use it, and we might even say *should* use it. The question that I think the core python devs want answered is *will* they use it. That's why I suggest that rather than having a bunch of numpy users make comments to python-dev, we really need authors of packages like the above to make comments to python-dev, saying "I could use this, and I'll *will* use this if it's been put into the standard lib". I do think there is one issue that does need to be addressed. The current buffer protocol already allows modules to share data without copying -- but it doesn't provide any description of that data. This proposal would provide more description of that data, but still not describe it completely -- that's just not possible, so how helpful is the additional description -- I think a lot, but others are not convinced. Some examples, from my use. I use wxPython a fair bit, so I'll use that as an example. Example 1: One can now currently pass a buffer into a wxImage constructor to create an image from existing data. For instance, you can pass in an WxHx3 numpy array of unsigned bytes in to create a WxH RGB image. At the moment, all the wx.Image constructor checks is if you've passed in the correct amount of bytes. With this proposal, it could check to see if you passed in a WxHx3 array of bytes. How helpful would that be? it's still up to the programmer to make sure those bytes actually represent what you want. You may catch a few errors, like accidentally passing in a 3xWxH array. You also wouldn't have to pass in the size of the Image you wanted - which would be kind of nice. One could also write the Image constructor so that it could take a few different shapes of data, and do the right thing with each of them. How compelling are these advantages? Example 2: When you need to draw something like a whole lot of pixels or a long polyline, you can now pass into wxPython either: a list of (x,y) tuples, or any sequence of any (x,y) sequences. A Nx2 numpy array appears as the latter, but it ends up being slower than the list of tuples, because wxPython has some code to optimize accessing lists of tuples. Internally, wxWidgets has drawing methods that accept a Nx2 c-array of ints. With the proposed protocol, wxPython could recognize that such an array was passed in, and save a LOT of sequence unpacking, type checking and converting, etc. It could also take multiple data types -- floats, ints, etc, and do the right thing with each of those. This to me is more compelling than the Image example. By the way, Robin Dunn has said that he doesn't want to include numpy dependency in wxPython, but would probably accept code that did the above if it didn't add any dependencies. Francesc Altet wrote:
That's the question -- is it an extraordinary leap over what you can now do with the existing buffer protocol? Matthew Brett wrote:
Is there already, or could there be, some sort of consortium of these that agree on the features in the PEP?
There isn't now, and that's my question -- what is the best way to involve the developers of some of the many packages that we envision using this. - random polling by the numpy devs and users? - more organized polling by numpy devs (or Travis) - just a note encouraging them to pipe in on the discussion here? - a note encouraging them to pipe in on the discussion at python-dev. I think the PEP has far more chances of success if it's seen as a request from a variety of package developers, not just the numpy crowd (which, after all, already has numpy) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/5b2449484c19f8e037c5d9c71e429508.jpg?s=120&d=mm&r=g)
Christopher Barker wrote: [SNIP] projects on board would help a lot; it might also reveal some deficiencies to the proposal that we don't see yet. I've only given the PEP a quick read through at this point, but here a couple of comments: 1. It seems very numpy-centric. That's not necessarily bad, but I think it would help to have some outsiders look it over -- perhaps they would see things that they need that it doesn't address. Conversely, there may universal opinion that some parts of it aren't needed, and we can strip the proposal down somewhat. 2. It seems pretty complicated. In particular, the PyDataFormatObject seems pretty complicated. This part in particular seems like it might be a hard sell, so I expect this is going to need considerable more motivation. For example: 1. Why do we need Py_ARRAYOF? Can't we get the same effect just using longer shape and strides arrays? 2. Is there any type besides Py_STRUCTURE that can have names and fields. If so, what and what do they mean. If not, you should just say that. 3. And on this topic, why a tuple of ([names,..], {field})? Why not simply a list of (name, dfobject, offset, meta) for example? And what's the meta information if it's not PyNone? Just a string? Anything at all? I'll try to give it a more thorough reading over the weekend. -tim
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
It would help quite a bit. Are there any suggestions of who to recruit to review the proposal? We should not forget that the NumPy world is quite diverse as well.
I've only given the PEP a quick read through at this point, but here a couple of comments:
Thank you for taking the time to read through it. I know it takes precious effort to do all this, which is why it's been so slow in coming from my end. It is important to get a lot of discussion on something like this. A lot of what is in the PEP does stem from a lot of discussion that's happened in the past 10 years, but admittedly some of it doesn't (extended data-format descriptions for example.).
Yes, this is true. I took the struct module, NumPy, and c-types as a guide for "what is needed" to be described in terms of memory.
Yes, the PyDataFormatObject is complicated --- but I don't think un-necessarily so. I've stripped a lot of it away from what's in NumPy to reduce it already. The question really is how are you going to describe what an arbitrary chunk of memory represents. One could restrict it to primitive types and replace the PyDataFormatObject with the enumerated typed and just give up on describing more complicated structures. But, my contention is why? Numarray and NumPy and C-types have already laid a tremendous amount of groundwork in how we can represent complicated data-structures. They clearly exist so why shouldn't we have some mechansim to describe them. Once you decide to handle complicated types you need to replace the simple enumerated type with something that is "self-recursive" (i.e. so you can have fields of arbitrary data-types). This lends itself to some-kind of structure design like the PyDataFormatObject. The only difference in what I've proposed to the c-types approach is that c-types over-loads Python Type Objects. (In other-words the PyDataFormatObject equivalent in c-types is at it's core a PyTypeObject while here it is built on PyObject).
1. Why do we need Py_ARRAYOF? Can't we get the same effect just using longer shape and strides arrays?
Yes, this is true for a single data-format in isolation (and in fact exactly what you get when you instantiate in NumPy a data-type that is an array of another primitive data-type). However, how do you describe a structure whose second field is an array of a primitive type? This is where the ARRAYOF qualifier is needed. In NumPy, actually, it's not done this way, but a separate subarray field in the data-type object is used. After studying c-types, however, I think this approach is better.
Yes, you can add fields to a multi-byte primitive if you want. This would be similar to thinking about the data-format as a C-like union. Perhaps the data-field has meaning as a 4-byte integer but the most-significant and least-significant bytes should also be addressable individually.
The list of names is useful for having an ordered list so you can traverse the structure in field order. It is technically not necessary but it makes it a lot easier to parse a data-format object in offset order (it is used a bit in NumPy, for example). The meta information is a place holder for field tags and future growth (kind of like column headers in a spreadsheet). It started as a place to put a "longer" name or to pass along information about a field (like units) through. -Travis
![](https://secure.gravatar.com/avatar/5b2449484c19f8e037c5d9c71e429508.jpg?s=120&d=mm&r=g)
On 1/6/07, Travis Oliphant <oliphant@ee.byu.edu> wrote:
Before I can answer that, I need to ask you a question. How do you see this extension to the buffer protocol? Do you see it as an supplement to the earlier array protocol, or do you see it as a replacement? The reason that I ask is that the two projects that I use regularly are wxPython and PIL generally operate on relatively large data chunks and it's not clear that they would see much benefit over this mechanism versus the array protocol. I imagine that between us Chris Barker and I could hack together something for wxPython (not that I've asked him aout it). And code would probably go a long way to convincing people what a great idea this is. However, all else being equal, it'd be a lot easier to do this for the array protocol since there's no extra infrastructure involved. [SNIP]
OK,. Needed for recursive data structures, check.
Hmm. I think I understand this somewhat better now, but I can't decide if it's cool or overkill. Is this a supporting a feature that ctypes has?
Right, I got that. Between names and field you are simulating an ordered dict. What I still don't understand is why you chose to simulate this ordered dict using a list plus a dictionary rather than a list of tuples. This may well just be a matter of taste. However, for the small sizes I'd expect of these lists I would expect a list of of tuples would perform better than the dictionary solution. The meta information is a place holder for field tags and future growth
FWIW, the array protocol PEP seems more relevant to what I do since I'm not concerned so much with the overhead since I'm sending big chunks of data back and forth. -tim
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Timothy Hochberg wrote:
This is a replacement to the previously described array protocol PEP. This is how I'm trying to get the array protocol into Python. In that vein, it has two purposes: One is to make a better buffer protocol that includes a conception of an N-dimensional array in Python itself. If we can include this in Python then we get a lot of mileage out of all the people that write extension modules for Python that should really be making their memory available as an N-dimensional array (everytime I turn around there is a new wrapping to some library that is *not* using NumPy as the underlying extension). With the existence of ctypes it just starts to get worse as nobody thinks about exposing things as arrays anymore and so NumPy users don't get the ease of use we would get if the N-dimensional array concept were a part of Python itself. For example, I just found the FreeImage project which wraps a nice library using ctypes. But, it doesn't have a way to expose these images as numpy arrays. Now, it would probably take me only a few hours to make the connection between FreeImage and NumPy, but I'd like to see the day when it happens without me (or some other NumPy expert) having to do all the work. If ctypes objects exposed the extended buffer protocol for appropriate types, then I wouldn't have to do anything. Because the wrapped structures would be exposable as arrays and all of a sudden I say a = array(freeimobj) and I can do math on the array in Python. Or if I'm an extension module writer, I don't need to have NumPy (or rely on it) in order to do some computation on freeimobj in C itself. Sure, you can do it now (if the array protocol is followed --- but not many people have adopted it yet --- some have argued that it's "not in Python itself"). So, I guess, the big reason I'm pushing this is largely marketing. The buffer protcol is the "right" place to but the array protocol. The second reason is to ensure that the buffer protocol itself doesn't "disappear" in Python 3000. Not all the Python devs seem to really see the value of it. But, it can sometimes be unclear as to what the attitudes are.
I don't know. It's basically a situation where it's easier to support it than to not and so it's there.
Ah. I misunderstood. You are right that if I had considered needing an ordered list of names up front, this kind of thing makes more sense. I think the reason for the choice of dictionary is that I was thinking of field access as attribute look-up which is just dictionary look-up. So, conceptually that was easier for me. But, tuples are probably less over-head (especially for small numbers of fields) with the expense of having to search for the field-name on field access. But, I'm trusting that dictionaries (especially small ones) are pretty optimized in Python (I haven't tested that assertion in this particular case, though).
This proposal is trying to get the array protocol *into* Python. So, this is the array protocol PEP. Anyone supportive of the array protocol should be interested in and thinking about this PEP. -Travis
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Timothy Hochberg wrote:
But is this mechanism any harder? It doesn't look like it to me. In fact, as I have written a tiny bit of Numeric extension code, this looks familiar and pretty easy to work with.
I imagine that between us Chris Barker and I could hack together something for wxPython (not that I've asked him aout it).
I'm not sure when I'll find the time, but I do want to do this.
Is it that much infrastructure? It looks like this would, at the least, require an extra include file. If this flies , then that will be delivered with python 2.8? until then (and for older pythons) would various extension writers all need to add this extra file to their source? And might we get a mess with different versions floating around out there trying to interact?
That's the biggest issue, but I think a lot of us use a lot of small arrays as well -- and while I don't know if it's a performance hit worth worrying about, it's always bugged me that is is faster to convert to a python list, then pass it in to wxPython than it is to just pass in the array directly. -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
![](https://secure.gravatar.com/avatar/5b2449484c19f8e037c5d9c71e429508.jpg?s=120&d=mm&r=g)
On 1/9/07, Christopher Barker <Chris.Barker@noaa.gov> wrote:
Let me preface my remarks by saying, that I was initially assuming that this was meant as a supplement to the earlier array protocol proposal, not a replacement as Travis subsequently explained. But is this mechanism any harder? It doesn't look like it to me. In
fact, as I have written a tiny bit of Numeric extension code, this looks familiar and pretty easy to work with.
I expect that the old proposal is easier to implement right now. We could implement the old array protocol in wxPython and have fairly seemless integration with numpy without any dependencies. To implement the new protocol, we'd need the C-API. Since that's not in Python at the moment, we'd have to include implementations of the various functions into wxPython. I suppose that wouldn't really be that bad assuming they are already implemented somewhere. It's a bit of a chicken and the egg problem though.
I'm not sure. It may not be that bad. I'd guess you'd need both an include file and a source file for the implementations of the functions. It looks like this would, at the least, require an extra include file.
Possibly. It shouldn't be a big deal if the API is frozen. But I expect the best thing to get this to work would be to implement this for as many projects as possible as trial patches before trying to get this into those projects officially. That way we can get some experience, tweak the API if necessary, then freeze it and release it officially. Like I said, I'll help with wxPython. I'm tempted to try with PIL as well, but I've never looked at the code there, not even tried to compile it, so I don't know how far I'd get.
You should have seen it in the old days before I sped it up. Actually I think you probably did. Anyway, it seems like wxPython is low hanging fruit in the sense that we could probably get it done without too much trouble and it would be pretty easy. It's possible that Robin may not accept the patch until the relevant code goes into Python, but just having a patch available would be a useful template for other project and would show the performance gains this approach would lead to. At least I sure hope so. Travis: does the code implementing the C API exist already, or is that something that still needs to be written? -tim
![](https://secure.gravatar.com/avatar/c7976f03fcae7e1199d28d1c20e34647.jpg?s=120&d=mm&r=g)
A few comments regarding what I think justifies some sort of standard to be part of Python (understanding that there are various ways that it could be done, so I'm not commenting on the specifics here directly). I don't there is any harm in making the standard numpy-centric. In fact, I think the selling point is that this standard means that any extension can expose repetitive data to those that want to manipulate it *in Python* in a simple and standard way. While its possible that ultimately there are those that will pass these sorts of things from one extension into another, I don't see that as a common use for a while. What it does mean is that if you want a simple Python-only way of seeing and modifying such data, all you need to do is install numpy. You don't have write a C extension. If the extension writers use the descriptive field names, they can make the array they expose somewhat self-documenting so that turning it into an array and doing a little introspection may tell you all you need to know about the data. Rather than try to sell this as some neutral interface, I would make the numpy-dependence explicit (not that it excludes extension-to- extension direct use). It may be that the developers of various extensions are not the ones most interested in this capability (after all, they've build it to do what they want), but I wouldn't be surprised if many of the users of the extension would like it so they can do things the extension doesn't allow them. So one approach is to see what the respective user communities think about such capabilities. If they find out what this can do for them, they may pressure the developers for such support. People in the numpy community can also volunteer to implement the standard (but with this approach, it's a bit of a chicken and egg thing as someone has mentioned. You can't do it if it isn't in Python yet.) I do agree that the most persuasive approach would be to have at least some of the 3rd party extensions support this explicitly on the python-dev list. Perry
![](https://secure.gravatar.com/avatar/2d1562d092a4d90284163439d5596556.jpg?s=120&d=mm&r=g)
On Thursday 04 January 2007 19:36, Travis Oliphant wrote:
Two more places to look for projects that may be interested: SQL wrappers, such as Psycopg2, and the Python DB API 2.0 community QuantLib (see the message below from the enthought-dev mailing list.) On Saturday 03 February 2007 00:23, Prabhu Ramachandran wrote:
![](https://secure.gravatar.com/avatar/b24e93182e89a519546baa7bafe054ed.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
It would help me to understand the proposal if it could be explained in terms of the methods of the existing buffer class/type: ['__add__', '__class__', '__cmp__', '__delattr__', '__delitem__', '__delslice__', '__doc__', '__getattribute__', '__getitem__', '__getslice__', '__hash__', '__init__', '__len__', '__mul__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__setitem__', '__setslice__', '__str__'] Numpy extends numarray's type/dtype object. This proposal appears to revert to the old letter codes. I have had very limited experience with C. Colin W.
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Colin J. Williams wrote:
It extends what is done in the array and struct modules of Python. The old letter codes are useful on the c-level. They are 'hidden' behind an enumeration however, and so should not be a big deal. But, the letter codes are still useful in other contexts.
I have had very limited experience with C.
Then this proposal will not be meaningful for you. This is a proposal to extend something on the C-level. There is nothing on the Python level suggested by this proposal. -Travis
![](https://secure.gravatar.com/avatar/0b7d465c9e16b93623fd6926775b91eb.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
I'm wondering if having the buffer object specify the view is the right choice. I think the best choice is to separate the design into: buffer: provides an interface to memory array: provides a view of memory as an array of whatever dimensions 1. buffer may or may not map to contiguous memory. 2. multiple views of the same memory can be shared. These different views could represent different slicings.
![](https://secure.gravatar.com/avatar/0b7d465c9e16b93623fd6926775b91eb.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
Several extensions to Python utilize the buffer protocol to share the location of a data-buffer that is really an N-dimensional array. However, there is no standard way to exchange the additional N-dimensional array information so that the data-buffer is interpreted correctly. I am questioning if this is the best concept. It says that the data-buffer will carry the information about it's interpretation as an N-dimensional array. I'm thinking that a buffer is just an interface to memory, and that the interpretation as an array of n-dimensions, for example, is best left to the application. I might want to at one time view the data as n-dimensional, but at another time as 1-dimensional, for example.
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On 1/5/07, Stefan van der Walt <stefan@sun.ac.za> wrote:
I think Neal is suggesting some object that basically does nothing but hold a pointer(s) to memory. This memory can be used in various ways, one of which is to use it construct another type of object that provides a view with indices and such, i.e., an array. That way the memory isn't tied to arrays and could concievable be used in other ways. The idea is analagous to the data/model/view paradigm. It is a bit cleaner than just ignoring the array parts. Chuck
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Neal Becker wrote:
The simple data-buffer interpretation is still there. You can still use simple "chunk-of-memory" only interpretation of the buffer. All we are doing is adding a way for applications to ask if the object can be interpreted as a strided N-dimensional array of a particular data-format. So, this proposal does nothing to jeopardize the buffer-as-an-interface-to-memory only model. I'm only using a table of funciton pointers which is already there (tp_as_buffer) rather than request an additional table of function pointers on the type object (tp_as_array). I see the array view idea as fitting very nicely with the buffer protocol. -Travis
![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
Neal Becker wrote:
Sure, but you need a standard way to communicate that extra information between different parts of your code and also between different third party libraries. That is what this PEP intends to provide. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
![](https://secure.gravatar.com/avatar/3f692386259303b90d1967ad662a9eb0.jpg?s=120&d=mm&r=g)
On 1/4/07, Travis Oliphant <oliphant@ee.byu.edu> wrote:
still be used to describe a complicated block of memory to another user.
Thinking of the scope "seamless data exchange between modules" my concern with this PEP is that is might be too much focused on "block of memory" rather than "access to data". Data that can be interpreted as an n-dimensional array doesn't necessarily has to be represented directly as a block of memory. Example1: We have a very large amount of data with a compressed internal representation Example2: We might want to generate data "on the fly" as it's needed Example3: If module creators to deal with different byte alignments, contiguousness etc it'll lead to lots of code duplication and unnecessarily much work Is it possible to add a data access API to this PEP? Direct memory access could be available through this API with a function that return the memory address (or NULL if not available). We could have a default implementation for basic types with the option for module creators to override this. The problem with this, if we stick to the buffer protocol, is that it breaks the concept "buffer is memory" if that ever was a valid. This is of minor concern for me though.
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Torgil Svensson wrote:
Could you give an example of what you mean? I have no problem with such a concept. I'm mainly interested in getting the NumPy memory model into Python some-how. I know it's not the "only" way to think about memory, but it is a widely-used and useful way. -Travis
![](https://secure.gravatar.com/avatar/3f692386259303b90d1967ad662a9eb0.jpg?s=120&d=mm&r=g)
On 1/11/07, Travis Oliphant <oliphant@ee.byu.edu> wrote:
Sure. I'm not objecting the memory model, what I mean is that data access between modules has a wider scope than just a memory model. Maybe i'm completely out-of-scope here, I thought this was worth considering for the inter-module-data-sharing - scope. Say we want to access a huge array with 1 million text-strings from another module that has a compressed representation in memory. Here's a pseudo-code-example with most of the details completely made up: buffer = AnotherModule_GetBigArrayAsBuffer() aview=buffer->bf_getarrayview() indexes=NewList() for(i=0; i<aview->shape[0] ; ++i) for(j=0; j<aview->shape[1] ; ++j) { item=aview->get_from_index(i) /* item represents the data described by the PyDataFormatObject */ if (is_interesting_item(item)) ListAdd(indexes,NewList(i,j)) } indexarr=Numpy_ArrayFromLists(indexes) Here, we don't have to care about any data layout-issues; called module could even produce data on-the-fly. If I want direct memory access we could use a function that returns data, strides and flags.
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On 1/11/07, Torgil Svensson <torgil.svensson@gmail.com> wrote:
This is where separating the memory block from the API starts to show advantages. OTOH, we should try to keep this all as simple and basic as possible. Trying to design for every potential use will lead to over design, it is a fine line to walk. <snip> Chuck
![](https://secure.gravatar.com/avatar/3f692386259303b90d1967ad662a9eb0.jpg?s=120&d=mm&r=g)
On 1/11/07, Charles R Harris <charlesr.harris@gmail.com> wrote:
I Agree. I'm trying to look after a use case of my own here where I have a huge array (won't fit memory) with data that is very easy to compress (easily fit in memory). OTOH, I have yet no need to share this between modules but a simple data access API opens up a variety of options. I my mindset, I can slice and dice my huge array and the implementation behind the data access API will choose between having the views represented internally as intervals or lists of indexes. So i'm +1 for having all information concerning nd-array access on a logical level (shapes) in one API and let the memory-layout-details (strides, FORTRAN, C etc) live in another API and a module that wants to try to skip the api overhead (numpy) can always do something like: memory_interface=array_interface->get_memory_layout() if (memory_interface) { ... use memory_interface->strides ... etc } else { ... use array_interface->get_item_fom_index() ... etc } I'm guessing that most of the modules trying to access an array will choose to go through numpy for fast operations. Another use of an api is to do things like give an "RGB"-view of an image regardless of which weird image format lying below without having to convert the whole image in-memory and loose precision or memory. If we want the whole in-memory-RGB-copy we could just take the RGB-view, pass it to numpy and force numpy to do a copy. The module can then, in either case, operate on the image through numpy or return a numpy object to the user. (numpy is of course integrated in python by then)
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Torgil Svensson wrote:
I think this is a good idea generally. I think the PIL would be much more open to this kind of API becauase the memory model of the PIL is different than ours. On the other hand, I think it would be a shame to not provide a basic N-d array memory model like NumPy has because it is used so often.
I had originally thought to separate these out in to multiple calls anyway. Perhaps we could propose the same thing. Have a full struct interface as one option and a multiple-call interface like you propose as another.
array_interface->get_block_from_slice() ? Such a thing would be very useful for all kinds of large data-sets, from images, and videos, to scientific data-sets.
Getting this array_interface into Python goes a long way into making that happen, I think. -Travis
![](https://secure.gravatar.com/avatar/0b7d465c9e16b93623fd6926775b91eb.jpg?s=120&d=mm&r=g)
I believe we are converging, and this is pretty much the same design as I advocated. It is similar to boost::ublas. Storage is one concept. Interpretation of the storage is another concept. Numpy is a combination of a storage and interpretation. Storage could be dense or sparse. Allocated in various ways. Sparse can be implemented in different ways. Interpretation can be 1-d, 2-d. Zero-based, non-zero based. Also there is question of ownership (slices).
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Neal Becker wrote:
I believe we are converging, and this is pretty much the same design as I advocated. It is similar to boost::ublas.
I'm grateful to hear that. It is nice when ideas come from several different corners.
How do we extend the buffer interface then? Do we have one API that allows sharing of storage and another that handles sharing of interpretation? How much detail should be in the interface regarding storage detail. Is there a possibility of having at least a few storage models "shareable" so that memory can be shared by others that view the data in the same way? -Travis
![](https://secure.gravatar.com/avatar/0b7d465c9e16b93623fd6926775b91eb.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
How about: 1. A memory concept, of which buffer is an example. 2. A view concept. 3. A variety of common concrete types composing 1+2. So then, how do we use buffer in this scheme? I'm thinking that buffer isn't really the best thing to build on - but within this scheme buffer is a kind of memory (assuming it provides/could_be_made_to_provide the required interface). The view is not part of buffer, (as was proposed) but a separate piece. Still, I agree that we want a commonly used array object that includes both the memory and the view. I propose that we build it out of these more generic pieces, but also provide commonly used compositions of these pieces. I think this satisfies the desire for a self-describing array component, while allowing more flexibility and serving a wider usage.
![](https://secure.gravatar.com/avatar/5b2449484c19f8e037c5d9c71e429508.jpg?s=120&d=mm&r=g)
On 1/12/07, Travis Oliphant <oliphant@ee.byu.edu > wrote:
I'm concerned about the direction that this PEP seems to be going. The original proposal was borderline too complicated IMO, and now it seems headed in the direction of more complexity. Also, it seems that there are three different goals getting conflated here. None are bad, but they don't and probably shouldn't, all be addressed by the same PEP. 1. Allowing producers and consumers of blocks of data to share blocks efficiently. This is half of what the original PEP proposed. 2. Describing complex data types at the c-level. This is the other half of the PEP[1]. 3. Things that act like arrays, but have different storage methods. This details of this still seem pretty vague, but to the extent that I can figure them out, it doesn't seem useful or necessary to tie this into the rest of the array interface PEP. For example, "array_interface->get_block_from_slice()" has been mentioned. Why that instead of "PyObject_AsExtendedBuffer(PyObject_GetItem(index), ....)"[2]. I'll stop here, till I see some more details of what people have in mind, but at this point, I think that alternative memory models are a different problem that should be addressed separately. Sadly, I'm leaving town shortly and I'm running out of time, so I'll have to leave my objections in this somewhat vague state. Oh, the way that F. Lundh plans to expose PIL's data a chunk at a time is mentioned in this python-dev summary: http://www.python.org/dev/summary/2006-11-01_2006-11-15/ It doesn't seem necessary to have special support for this; all that is necessary is for the object returned by acquire_view to support the extended array protocol. [1] Remind me again why we can't simply use ctypes for this? It's already in the core. I'm sure it's less efficient, but you shouldn't need to parse the data structure information very often. I suspect that something that leveraged ctypes would meet less resistance. [2] Which reminds me. I never saw in the PEP what the actual call in the buffer protocol was supposed to look like. Is it something like: PyObject_AsExtendedBuffer(PyObject * obj, void **buffer, Py_ssize_t *buffer_len, funcptr *bf_getarrayview, funcptr *bf_relarrayview) ? -- //=][=\\ tim.hochberg@ieee.org
![](https://secure.gravatar.com/avatar/3f692386259303b90d1967ad662a9eb0.jpg?s=120&d=mm&r=g)
On 1/12/07, Timothy Hochberg <tim.hochberg@ieee.org> wrote:
Looks like an array, act like an array, smells like an array = is an array
What is an "Extended buffer" ? Connecting that to array information doesn't feel intuitive.
I agree
[1] Remind me again why we can't simply use ctypes for this?
1. ctypes is designed for "c types", not "array layout" 2. managing/creating complex formats in ctypes deviates from the clean, intuitive and simple (considerably compared to dtypes) => ugly code 3. Can ctypes handle anonymous lambda function pointers?
the core. I'm sure it's less efficient, but you shouldn't need to parse the data structure information very often.
I believe that'll be more common than you think; for example dynamically creating/combining/slicing recarrays with various data. //Torgil
![](https://secure.gravatar.com/avatar/5b2449484c19f8e037c5d9c71e429508.jpg?s=120&d=mm&r=g)
On 1/12/07, Torgil Svensson <torgil.svensson@gmail.com> wrote:
I was unclear here. I didn't mean like that it would be infrequent in "once a month" sense. I meant that you would only need to look at the data structure information once per set of data that you are accessing and that you would typically extract many chunks of data from each set, so the amortized cost of parsing the data structure would be small. Trying to get out the door.... -- //=][=\\ tim.hochberg@ieee.org
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Well at least people are talking about what they would like to see. But, I think we should reign in the discussion.
I'm leaning this way too.
Yes, I agree.
Two reasons: 1) ctypes wasn't designed for this purpose specifically and leaves out certain things 2) ctypes uses many Python types instead of just a single python type (the PyDataFormatOjbect).
No, not like that. The bf_getarrayview function pointers hang-off of the as_buffer method table which is pointed to by the typeobject. You could always access the API using those function pointers, but it is more traditional to use an API call which adds some checking to make sure the function pointer is there and then calling the function pointer. I don't know if I go into a lot of detail there, but I should probably add more. PEP's are rather "expensive" for me in terms of how much immediate benefit the change to Python is to me personally versus that time spent writing them. The benefit here is much more long term in establishing a useful data-model that could be used by a lot of applications in Python to exchange data (and help ameliorate the proliferation of objects in Python that are essentially and should be NumPy arrays). -Travis
![](https://secure.gravatar.com/avatar/5c9fb379c4e97b58960d74dcbfc5dee5.jpg?s=120&d=mm&r=g)
Talking about the difference between the memory access model and the array API, maybe I am talking bullshit (I know next to nothing in these problems) but couldn' an efficient tree data structure be implemented on the memory buffer object ? I am pretty sure a simple tree read-only could, as for a tree that is edited, I am not so sure. Anyhow read-only tree are used a lot by some people. A lab next to mine uses them to describe results from there experiment. They store events in tree-like structures (I have been told that copied that from the CERN). They can then walk throught the tree in a very efficient way, and do statistical analysis on their collection of event. I am not sure if this can fit anywhere in the PEP, but it would sure enlarge its scope. Please enlight me. Gaël
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On 1/12/07, Gael Varoquaux <gael.varoquaux@normalesup.org> wrote:
Trees are nice, but they are not efficient for array type data. Traversing a tree usually requires some sort of stack (recursion), and a tree is not well structured for addressing data using indices. They just aren't appropriate for arrays, arrays are better represented by some sort of lattice. Anyhow read-only tree are used a lot by some people. A lab next to mine
Probably from ROOT? I am not sure if this can fit anywhere in the PEP, but it would sure
enlarge its scope.
There is probably a tree module somewhere for python. Chuck
![](https://secure.gravatar.com/avatar/5c9fb379c4e97b58960d74dcbfc5dee5.jpg?s=120&d=mm&r=g)
On Fri, Jan 12, 2007 at 12:44:15AM -0700, Charles R Harris wrote:
Yes, indeed. I was just wondering if the PEP could be used for a performant implementation of trees. Basicaly that is mapping a tree to an array, which is possible. As far as performance, I think this is not performant at all when modifying the tree, but I do not know if it is possible to have an efficient traversing of the tree when it is mapped to an array.
Probably from ROOT?
Yes. It seems a nice software for such things. The problem with it is that you have to learn C++, and experience shows that not everybody in an experimental lab is willing to do so. Gaël
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Gael Varoquaux wrote:
Yes, indeed. I was just wondering if the PEP could be used for a performant implementation of trees.
That would be a whole new PEP, and one we're not the least bit ready for.
Basicaly that is mapping a tree to an array, which is possible.
Possible, but probably not very useful for dense data -- maybe for sparse arrays? The idea of an array API, rather than (actually in addition to) and array data structure is fabulous! It could be used for sparse arrays, for instance. I do think it's a topic for another PEP, and probably not even a PEP until we have at least some working code - maybe a sparse array and/or PIL image?
I think a slicing API is critical -- at least at the Python level, though at the C level is would sure be nice, and probably could allow for some good optimizations for getting a "block" of data out of some odd data structure. "Simple is better than complex." "Although practicality beats purity." "Now is better than never." Tells me that we should just focus on the array data structure for the PEP now. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Travis, First, thanks for doing this -- Python really needs it!
Ah, I do like reducing that overhead -- I know I use arrays a lot for small data sets too, so that overhead can be significant. I'm not well qualified to review the tech details, but to make sure I have this right:
So If I have some C code that wants to use any array passed in, I can just call: bf_getarrayview (obj) and if it doesn't return NULL, I have a valid array that I can query to see if it fits what I'm expecting. Have I got that right? If so, this would be great. By the way,, how compatible is this with the existing buffer protocol? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Christopher Barker wrote:
Yes you could call this (but you would call it from the type object like this obj->ob_type->tp_as_buffer->bf_getarrayview(obj) Or more likely (and I should add this to the C-API) you would call. PyArrayView_FromObject(obj) which does this under the covers.
It's basically orthogonal. In other-words, if you defined the array view protocol you would not need the buffer protocol at all. But you could easily define both. -Travis
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
yes, that's what I'm looking for -- please do add that to the C-API
OK, so if one of these were passed into something expecting the buffer protocol, then it wouldn't work, but you could make an object conform to both protocols at once -- like numpy does now, I suppose -- very nice. Another question -- is this new approach in response to feedback from Guido and or other Python devs? This sure seems like a good way to go -- though it seems from the last discussion I followed at python-dev, most of the devs just didn't get how useful this would be! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
The new approach is a response to the devs and to Sasha who had some relevant comments. Yes, I agree that many devs don't "get" how useful it would be because they have not written scientific or graphics-intensive applications. However, Guido is the one who encouraged me at SciPy 2006 to push this and so I think he is generally favorable to the idea. The Python devs will definitely push back. The strongest opposition seems to be from people that don't 'get' it and so don't want "dead interfaces" in Python. They would need to be convinced of how often such an interface would actually get used. I've tried to do that in the rationale, but many people actually posting to python-dev to support the basic idea (you don't have to support the specific implementation --- most are going to be uncomfortable knowing enough to make a stand). However, there is a great need for people to stand up and say: "We need something like this in Python..." -Travis
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
good sign.
well, none of us want dead interfaces in Python.
let us know when there is a relevant thread to chime in on. However, what we really need is not people like me saying "I need this", but rather people that develop significant existing extension packages saying they'll actually use this in their package. People like: wxPython -- Robin Dunn PIL -- Fredrik Lundh PyOpenGL -- Who? PyObjC -- would it be useful there? (Ronald Oussoren) MatplotLib (but maybe it's already married to numpy...) PyGtk ? Who else? I know Robin Dunn is interested in using it in wxPython -- but probably only if someone contributes the code. I hope to do that some day, but I'm only barely qualified to do so. Fredrik accepted your submission of code to use the array interface in PIL, but he seemed skeptical of the idea. Perhaps lobbying (or even just surveying) some of these folks would be useful. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Christopher Barker wrote:
It's a good start, but their is also PyMedia, PyVoxel, any video-library interface writers, any audo-library interface writers. Anybody who wants to wrap/write code that does some kind of manipulation on a chunk of data of a specific data-format. There are so many people who would use it that I don't feel qualified to speak for them all. -Travis
![](https://secure.gravatar.com/avatar/5c7407de6b47afcd3b3e2164ff5bcd45.jpg?s=120&d=mm&r=g)
A Divendres 05 Gener 2007 01:36, Travis Oliphant escrigué:
Yeah. I think this is the case for PyTables. However, PyTables case should be similar to matplotlib: it needs so many features of NumPy that it is just conceivable it can live with just an implementation of the array interface. In any case, I think that if the PEP has success, it would suppose an extraordinary leap towards efficient data interchange in applications that doesn't need (or are reluctant to include) NumPy for their normal operation. Cheers, --
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
right -- I didn't intend that to be a comprehensive list.
I think this is key -- we all know that there are a lot of people that *could* use it, and we might even say *should* use it. The question that I think the core python devs want answered is *will* they use it. That's why I suggest that rather than having a bunch of numpy users make comments to python-dev, we really need authors of packages like the above to make comments to python-dev, saying "I could use this, and I'll *will* use this if it's been put into the standard lib". I do think there is one issue that does need to be addressed. The current buffer protocol already allows modules to share data without copying -- but it doesn't provide any description of that data. This proposal would provide more description of that data, but still not describe it completely -- that's just not possible, so how helpful is the additional description -- I think a lot, but others are not convinced. Some examples, from my use. I use wxPython a fair bit, so I'll use that as an example. Example 1: One can now currently pass a buffer into a wxImage constructor to create an image from existing data. For instance, you can pass in an WxHx3 numpy array of unsigned bytes in to create a WxH RGB image. At the moment, all the wx.Image constructor checks is if you've passed in the correct amount of bytes. With this proposal, it could check to see if you passed in a WxHx3 array of bytes. How helpful would that be? it's still up to the programmer to make sure those bytes actually represent what you want. You may catch a few errors, like accidentally passing in a 3xWxH array. You also wouldn't have to pass in the size of the Image you wanted - which would be kind of nice. One could also write the Image constructor so that it could take a few different shapes of data, and do the right thing with each of them. How compelling are these advantages? Example 2: When you need to draw something like a whole lot of pixels or a long polyline, you can now pass into wxPython either: a list of (x,y) tuples, or any sequence of any (x,y) sequences. A Nx2 numpy array appears as the latter, but it ends up being slower than the list of tuples, because wxPython has some code to optimize accessing lists of tuples. Internally, wxWidgets has drawing methods that accept a Nx2 c-array of ints. With the proposed protocol, wxPython could recognize that such an array was passed in, and save a LOT of sequence unpacking, type checking and converting, etc. It could also take multiple data types -- floats, ints, etc, and do the right thing with each of those. This to me is more compelling than the Image example. By the way, Robin Dunn has said that he doesn't want to include numpy dependency in wxPython, but would probably accept code that did the above if it didn't add any dependencies. Francesc Altet wrote:
That's the question -- is it an extraordinary leap over what you can now do with the existing buffer protocol? Matthew Brett wrote:
Is there already, or could there be, some sort of consortium of these that agree on the features in the PEP?
There isn't now, and that's my question -- what is the best way to involve the developers of some of the many packages that we envision using this. - random polling by the numpy devs and users? - more organized polling by numpy devs (or Travis) - just a note encouraging them to pipe in on the discussion here? - a note encouraging them to pipe in on the discussion at python-dev. I think the PEP has far more chances of success if it's seen as a request from a variety of package developers, not just the numpy crowd (which, after all, already has numpy) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/5b2449484c19f8e037c5d9c71e429508.jpg?s=120&d=mm&r=g)
Christopher Barker wrote: [SNIP] projects on board would help a lot; it might also reveal some deficiencies to the proposal that we don't see yet. I've only given the PEP a quick read through at this point, but here a couple of comments: 1. It seems very numpy-centric. That's not necessarily bad, but I think it would help to have some outsiders look it over -- perhaps they would see things that they need that it doesn't address. Conversely, there may universal opinion that some parts of it aren't needed, and we can strip the proposal down somewhat. 2. It seems pretty complicated. In particular, the PyDataFormatObject seems pretty complicated. This part in particular seems like it might be a hard sell, so I expect this is going to need considerable more motivation. For example: 1. Why do we need Py_ARRAYOF? Can't we get the same effect just using longer shape and strides arrays? 2. Is there any type besides Py_STRUCTURE that can have names and fields. If so, what and what do they mean. If not, you should just say that. 3. And on this topic, why a tuple of ([names,..], {field})? Why not simply a list of (name, dfobject, offset, meta) for example? And what's the meta information if it's not PyNone? Just a string? Anything at all? I'll try to give it a more thorough reading over the weekend. -tim
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Tim Hochberg wrote:
It would help quite a bit. Are there any suggestions of who to recruit to review the proposal? We should not forget that the NumPy world is quite diverse as well.
I've only given the PEP a quick read through at this point, but here a couple of comments:
Thank you for taking the time to read through it. I know it takes precious effort to do all this, which is why it's been so slow in coming from my end. It is important to get a lot of discussion on something like this. A lot of what is in the PEP does stem from a lot of discussion that's happened in the past 10 years, but admittedly some of it doesn't (extended data-format descriptions for example.).
Yes, this is true. I took the struct module, NumPy, and c-types as a guide for "what is needed" to be described in terms of memory.
Yes, the PyDataFormatObject is complicated --- but I don't think un-necessarily so. I've stripped a lot of it away from what's in NumPy to reduce it already. The question really is how are you going to describe what an arbitrary chunk of memory represents. One could restrict it to primitive types and replace the PyDataFormatObject with the enumerated typed and just give up on describing more complicated structures. But, my contention is why? Numarray and NumPy and C-types have already laid a tremendous amount of groundwork in how we can represent complicated data-structures. They clearly exist so why shouldn't we have some mechansim to describe them. Once you decide to handle complicated types you need to replace the simple enumerated type with something that is "self-recursive" (i.e. so you can have fields of arbitrary data-types). This lends itself to some-kind of structure design like the PyDataFormatObject. The only difference in what I've proposed to the c-types approach is that c-types over-loads Python Type Objects. (In other-words the PyDataFormatObject equivalent in c-types is at it's core a PyTypeObject while here it is built on PyObject).
1. Why do we need Py_ARRAYOF? Can't we get the same effect just using longer shape and strides arrays?
Yes, this is true for a single data-format in isolation (and in fact exactly what you get when you instantiate in NumPy a data-type that is an array of another primitive data-type). However, how do you describe a structure whose second field is an array of a primitive type? This is where the ARRAYOF qualifier is needed. In NumPy, actually, it's not done this way, but a separate subarray field in the data-type object is used. After studying c-types, however, I think this approach is better.
Yes, you can add fields to a multi-byte primitive if you want. This would be similar to thinking about the data-format as a C-like union. Perhaps the data-field has meaning as a 4-byte integer but the most-significant and least-significant bytes should also be addressable individually.
The list of names is useful for having an ordered list so you can traverse the structure in field order. It is technically not necessary but it makes it a lot easier to parse a data-format object in offset order (it is used a bit in NumPy, for example). The meta information is a place holder for field tags and future growth (kind of like column headers in a spreadsheet). It started as a place to put a "longer" name or to pass along information about a field (like units) through. -Travis
![](https://secure.gravatar.com/avatar/5b2449484c19f8e037c5d9c71e429508.jpg?s=120&d=mm&r=g)
On 1/6/07, Travis Oliphant <oliphant@ee.byu.edu> wrote:
Before I can answer that, I need to ask you a question. How do you see this extension to the buffer protocol? Do you see it as an supplement to the earlier array protocol, or do you see it as a replacement? The reason that I ask is that the two projects that I use regularly are wxPython and PIL generally operate on relatively large data chunks and it's not clear that they would see much benefit over this mechanism versus the array protocol. I imagine that between us Chris Barker and I could hack together something for wxPython (not that I've asked him aout it). And code would probably go a long way to convincing people what a great idea this is. However, all else being equal, it'd be a lot easier to do this for the array protocol since there's no extra infrastructure involved. [SNIP]
OK,. Needed for recursive data structures, check.
Hmm. I think I understand this somewhat better now, but I can't decide if it's cool or overkill. Is this a supporting a feature that ctypes has?
Right, I got that. Between names and field you are simulating an ordered dict. What I still don't understand is why you chose to simulate this ordered dict using a list plus a dictionary rather than a list of tuples. This may well just be a matter of taste. However, for the small sizes I'd expect of these lists I would expect a list of of tuples would perform better than the dictionary solution. The meta information is a place holder for field tags and future growth
FWIW, the array protocol PEP seems more relevant to what I do since I'm not concerned so much with the overhead since I'm sending big chunks of data back and forth. -tim
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Timothy Hochberg wrote:
This is a replacement to the previously described array protocol PEP. This is how I'm trying to get the array protocol into Python. In that vein, it has two purposes: One is to make a better buffer protocol that includes a conception of an N-dimensional array in Python itself. If we can include this in Python then we get a lot of mileage out of all the people that write extension modules for Python that should really be making their memory available as an N-dimensional array (everytime I turn around there is a new wrapping to some library that is *not* using NumPy as the underlying extension). With the existence of ctypes it just starts to get worse as nobody thinks about exposing things as arrays anymore and so NumPy users don't get the ease of use we would get if the N-dimensional array concept were a part of Python itself. For example, I just found the FreeImage project which wraps a nice library using ctypes. But, it doesn't have a way to expose these images as numpy arrays. Now, it would probably take me only a few hours to make the connection between FreeImage and NumPy, but I'd like to see the day when it happens without me (or some other NumPy expert) having to do all the work. If ctypes objects exposed the extended buffer protocol for appropriate types, then I wouldn't have to do anything. Because the wrapped structures would be exposable as arrays and all of a sudden I say a = array(freeimobj) and I can do math on the array in Python. Or if I'm an extension module writer, I don't need to have NumPy (or rely on it) in order to do some computation on freeimobj in C itself. Sure, you can do it now (if the array protocol is followed --- but not many people have adopted it yet --- some have argued that it's "not in Python itself"). So, I guess, the big reason I'm pushing this is largely marketing. The buffer protcol is the "right" place to but the array protocol. The second reason is to ensure that the buffer protocol itself doesn't "disappear" in Python 3000. Not all the Python devs seem to really see the value of it. But, it can sometimes be unclear as to what the attitudes are.
I don't know. It's basically a situation where it's easier to support it than to not and so it's there.
Ah. I misunderstood. You are right that if I had considered needing an ordered list of names up front, this kind of thing makes more sense. I think the reason for the choice of dictionary is that I was thinking of field access as attribute look-up which is just dictionary look-up. So, conceptually that was easier for me. But, tuples are probably less over-head (especially for small numbers of fields) with the expense of having to search for the field-name on field access. But, I'm trusting that dictionaries (especially small ones) are pretty optimized in Python (I haven't tested that assertion in this particular case, though).
This proposal is trying to get the array protocol *into* Python. So, this is the array protocol PEP. Anyone supportive of the array protocol should be interested in and thinking about this PEP. -Travis
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Timothy Hochberg wrote:
But is this mechanism any harder? It doesn't look like it to me. In fact, as I have written a tiny bit of Numeric extension code, this looks familiar and pretty easy to work with.
I imagine that between us Chris Barker and I could hack together something for wxPython (not that I've asked him aout it).
I'm not sure when I'll find the time, but I do want to do this.
Is it that much infrastructure? It looks like this would, at the least, require an extra include file. If this flies , then that will be delivered with python 2.8? until then (and for older pythons) would various extension writers all need to add this extra file to their source? And might we get a mess with different versions floating around out there trying to interact?
That's the biggest issue, but I think a lot of us use a lot of small arrays as well -- and while I don't know if it's a performance hit worth worrying about, it's always bugged me that is is faster to convert to a python list, then pass it in to wxPython than it is to just pass in the array directly. -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
![](https://secure.gravatar.com/avatar/5b2449484c19f8e037c5d9c71e429508.jpg?s=120&d=mm&r=g)
On 1/9/07, Christopher Barker <Chris.Barker@noaa.gov> wrote:
Let me preface my remarks by saying, that I was initially assuming that this was meant as a supplement to the earlier array protocol proposal, not a replacement as Travis subsequently explained. But is this mechanism any harder? It doesn't look like it to me. In
fact, as I have written a tiny bit of Numeric extension code, this looks familiar and pretty easy to work with.
I expect that the old proposal is easier to implement right now. We could implement the old array protocol in wxPython and have fairly seemless integration with numpy without any dependencies. To implement the new protocol, we'd need the C-API. Since that's not in Python at the moment, we'd have to include implementations of the various functions into wxPython. I suppose that wouldn't really be that bad assuming they are already implemented somewhere. It's a bit of a chicken and the egg problem though.
I'm not sure. It may not be that bad. I'd guess you'd need both an include file and a source file for the implementations of the functions. It looks like this would, at the least, require an extra include file.
Possibly. It shouldn't be a big deal if the API is frozen. But I expect the best thing to get this to work would be to implement this for as many projects as possible as trial patches before trying to get this into those projects officially. That way we can get some experience, tweak the API if necessary, then freeze it and release it officially. Like I said, I'll help with wxPython. I'm tempted to try with PIL as well, but I've never looked at the code there, not even tried to compile it, so I don't know how far I'd get.
You should have seen it in the old days before I sped it up. Actually I think you probably did. Anyway, it seems like wxPython is low hanging fruit in the sense that we could probably get it done without too much trouble and it would be pretty easy. It's possible that Robin may not accept the patch until the relevant code goes into Python, but just having a patch available would be a useful template for other project and would show the performance gains this approach would lead to. At least I sure hope so. Travis: does the code implementing the C API exist already, or is that something that still needs to be written? -tim
![](https://secure.gravatar.com/avatar/c7976f03fcae7e1199d28d1c20e34647.jpg?s=120&d=mm&r=g)
A few comments regarding what I think justifies some sort of standard to be part of Python (understanding that there are various ways that it could be done, so I'm not commenting on the specifics here directly). I don't there is any harm in making the standard numpy-centric. In fact, I think the selling point is that this standard means that any extension can expose repetitive data to those that want to manipulate it *in Python* in a simple and standard way. While its possible that ultimately there are those that will pass these sorts of things from one extension into another, I don't see that as a common use for a while. What it does mean is that if you want a simple Python-only way of seeing and modifying such data, all you need to do is install numpy. You don't have write a C extension. If the extension writers use the descriptive field names, they can make the array they expose somewhat self-documenting so that turning it into an array and doing a little introspection may tell you all you need to know about the data. Rather than try to sell this as some neutral interface, I would make the numpy-dependence explicit (not that it excludes extension-to- extension direct use). It may be that the developers of various extensions are not the ones most interested in this capability (after all, they've build it to do what they want), but I wouldn't be surprised if many of the users of the extension would like it so they can do things the extension doesn't allow them. So one approach is to see what the respective user communities think about such capabilities. If they find out what this can do for them, they may pressure the developers for such support. People in the numpy community can also volunteer to implement the standard (but with this approach, it's a bit of a chicken and egg thing as someone has mentioned. You can't do it if it isn't in Python yet.) I do agree that the most persuasive approach would be to have at least some of the 3rd party extensions support this explicitly on the python-dev list. Perry
![](https://secure.gravatar.com/avatar/2d1562d092a4d90284163439d5596556.jpg?s=120&d=mm&r=g)
On Thursday 04 January 2007 19:36, Travis Oliphant wrote:
Two more places to look for projects that may be interested: SQL wrappers, such as Psycopg2, and the Python DB API 2.0 community QuantLib (see the message below from the enthought-dev mailing list.) On Saturday 03 February 2007 00:23, Prabhu Ramachandran wrote:
![](https://secure.gravatar.com/avatar/b24e93182e89a519546baa7bafe054ed.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
It would help me to understand the proposal if it could be explained in terms of the methods of the existing buffer class/type: ['__add__', '__class__', '__cmp__', '__delattr__', '__delitem__', '__delslice__', '__doc__', '__getattribute__', '__getitem__', '__getslice__', '__hash__', '__init__', '__len__', '__mul__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__setitem__', '__setslice__', '__str__'] Numpy extends numarray's type/dtype object. This proposal appears to revert to the old letter codes. I have had very limited experience with C. Colin W.
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Colin J. Williams wrote:
It extends what is done in the array and struct modules of Python. The old letter codes are useful on the c-level. They are 'hidden' behind an enumeration however, and so should not be a big deal. But, the letter codes are still useful in other contexts.
I have had very limited experience with C.
Then this proposal will not be meaningful for you. This is a proposal to extend something on the C-level. There is nothing on the Python level suggested by this proposal. -Travis
![](https://secure.gravatar.com/avatar/0b7d465c9e16b93623fd6926775b91eb.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
I'm wondering if having the buffer object specify the view is the right choice. I think the best choice is to separate the design into: buffer: provides an interface to memory array: provides a view of memory as an array of whatever dimensions 1. buffer may or may not map to contiguous memory. 2. multiple views of the same memory can be shared. These different views could represent different slicings.
![](https://secure.gravatar.com/avatar/0b7d465c9e16b93623fd6926775b91eb.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
Several extensions to Python utilize the buffer protocol to share the location of a data-buffer that is really an N-dimensional array. However, there is no standard way to exchange the additional N-dimensional array information so that the data-buffer is interpreted correctly. I am questioning if this is the best concept. It says that the data-buffer will carry the information about it's interpretation as an N-dimensional array. I'm thinking that a buffer is just an interface to memory, and that the interpretation as an array of n-dimensions, for example, is best left to the application. I might want to at one time view the data as n-dimensional, but at another time as 1-dimensional, for example.
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On 1/5/07, Stefan van der Walt <stefan@sun.ac.za> wrote:
I think Neal is suggesting some object that basically does nothing but hold a pointer(s) to memory. This memory can be used in various ways, one of which is to use it construct another type of object that provides a view with indices and such, i.e., an array. That way the memory isn't tied to arrays and could concievable be used in other ways. The idea is analagous to the data/model/view paradigm. It is a bit cleaner than just ignoring the array parts. Chuck
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Neal Becker wrote:
The simple data-buffer interpretation is still there. You can still use simple "chunk-of-memory" only interpretation of the buffer. All we are doing is adding a way for applications to ask if the object can be interpreted as a strided N-dimensional array of a particular data-format. So, this proposal does nothing to jeopardize the buffer-as-an-interface-to-memory only model. I'm only using a table of funciton pointers which is already there (tp_as_buffer) rather than request an additional table of function pointers on the type object (tp_as_array). I see the array view idea as fitting very nicely with the buffer protocol. -Travis
![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
Neal Becker wrote:
Sure, but you need a standard way to communicate that extra information between different parts of your code and also between different third party libraries. That is what this PEP intends to provide. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
![](https://secure.gravatar.com/avatar/3f692386259303b90d1967ad662a9eb0.jpg?s=120&d=mm&r=g)
On 1/4/07, Travis Oliphant <oliphant@ee.byu.edu> wrote:
still be used to describe a complicated block of memory to another user.
Thinking of the scope "seamless data exchange between modules" my concern with this PEP is that is might be too much focused on "block of memory" rather than "access to data". Data that can be interpreted as an n-dimensional array doesn't necessarily has to be represented directly as a block of memory. Example1: We have a very large amount of data with a compressed internal representation Example2: We might want to generate data "on the fly" as it's needed Example3: If module creators to deal with different byte alignments, contiguousness etc it'll lead to lots of code duplication and unnecessarily much work Is it possible to add a data access API to this PEP? Direct memory access could be available through this API with a function that return the memory address (or NULL if not available). We could have a default implementation for basic types with the option for module creators to override this. The problem with this, if we stick to the buffer protocol, is that it breaks the concept "buffer is memory" if that ever was a valid. This is of minor concern for me though.
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Torgil Svensson wrote:
Could you give an example of what you mean? I have no problem with such a concept. I'm mainly interested in getting the NumPy memory model into Python some-how. I know it's not the "only" way to think about memory, but it is a widely-used and useful way. -Travis
![](https://secure.gravatar.com/avatar/3f692386259303b90d1967ad662a9eb0.jpg?s=120&d=mm&r=g)
On 1/11/07, Travis Oliphant <oliphant@ee.byu.edu> wrote:
Sure. I'm not objecting the memory model, what I mean is that data access between modules has a wider scope than just a memory model. Maybe i'm completely out-of-scope here, I thought this was worth considering for the inter-module-data-sharing - scope. Say we want to access a huge array with 1 million text-strings from another module that has a compressed representation in memory. Here's a pseudo-code-example with most of the details completely made up: buffer = AnotherModule_GetBigArrayAsBuffer() aview=buffer->bf_getarrayview() indexes=NewList() for(i=0; i<aview->shape[0] ; ++i) for(j=0; j<aview->shape[1] ; ++j) { item=aview->get_from_index(i) /* item represents the data described by the PyDataFormatObject */ if (is_interesting_item(item)) ListAdd(indexes,NewList(i,j)) } indexarr=Numpy_ArrayFromLists(indexes) Here, we don't have to care about any data layout-issues; called module could even produce data on-the-fly. If I want direct memory access we could use a function that returns data, strides and flags.
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On 1/11/07, Torgil Svensson <torgil.svensson@gmail.com> wrote:
This is where separating the memory block from the API starts to show advantages. OTOH, we should try to keep this all as simple and basic as possible. Trying to design for every potential use will lead to over design, it is a fine line to walk. <snip> Chuck
![](https://secure.gravatar.com/avatar/3f692386259303b90d1967ad662a9eb0.jpg?s=120&d=mm&r=g)
On 1/11/07, Charles R Harris <charlesr.harris@gmail.com> wrote:
I Agree. I'm trying to look after a use case of my own here where I have a huge array (won't fit memory) with data that is very easy to compress (easily fit in memory). OTOH, I have yet no need to share this between modules but a simple data access API opens up a variety of options. I my mindset, I can slice and dice my huge array and the implementation behind the data access API will choose between having the views represented internally as intervals or lists of indexes. So i'm +1 for having all information concerning nd-array access on a logical level (shapes) in one API and let the memory-layout-details (strides, FORTRAN, C etc) live in another API and a module that wants to try to skip the api overhead (numpy) can always do something like: memory_interface=array_interface->get_memory_layout() if (memory_interface) { ... use memory_interface->strides ... etc } else { ... use array_interface->get_item_fom_index() ... etc } I'm guessing that most of the modules trying to access an array will choose to go through numpy for fast operations. Another use of an api is to do things like give an "RGB"-view of an image regardless of which weird image format lying below without having to convert the whole image in-memory and loose precision or memory. If we want the whole in-memory-RGB-copy we could just take the RGB-view, pass it to numpy and force numpy to do a copy. The module can then, in either case, operate on the image through numpy or return a numpy object to the user. (numpy is of course integrated in python by then)
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Torgil Svensson wrote:
I think this is a good idea generally. I think the PIL would be much more open to this kind of API becauase the memory model of the PIL is different than ours. On the other hand, I think it would be a shame to not provide a basic N-d array memory model like NumPy has because it is used so often.
I had originally thought to separate these out in to multiple calls anyway. Perhaps we could propose the same thing. Have a full struct interface as one option and a multiple-call interface like you propose as another.
array_interface->get_block_from_slice() ? Such a thing would be very useful for all kinds of large data-sets, from images, and videos, to scientific data-sets.
Getting this array_interface into Python goes a long way into making that happen, I think. -Travis
![](https://secure.gravatar.com/avatar/0b7d465c9e16b93623fd6926775b91eb.jpg?s=120&d=mm&r=g)
I believe we are converging, and this is pretty much the same design as I advocated. It is similar to boost::ublas. Storage is one concept. Interpretation of the storage is another concept. Numpy is a combination of a storage and interpretation. Storage could be dense or sparse. Allocated in various ways. Sparse can be implemented in different ways. Interpretation can be 1-d, 2-d. Zero-based, non-zero based. Also there is question of ownership (slices).
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Neal Becker wrote:
I believe we are converging, and this is pretty much the same design as I advocated. It is similar to boost::ublas.
I'm grateful to hear that. It is nice when ideas come from several different corners.
How do we extend the buffer interface then? Do we have one API that allows sharing of storage and another that handles sharing of interpretation? How much detail should be in the interface regarding storage detail. Is there a possibility of having at least a few storage models "shareable" so that memory can be shared by others that view the data in the same way? -Travis
![](https://secure.gravatar.com/avatar/0b7d465c9e16b93623fd6926775b91eb.jpg?s=120&d=mm&r=g)
Travis Oliphant wrote:
How about: 1. A memory concept, of which buffer is an example. 2. A view concept. 3. A variety of common concrete types composing 1+2. So then, how do we use buffer in this scheme? I'm thinking that buffer isn't really the best thing to build on - but within this scheme buffer is a kind of memory (assuming it provides/could_be_made_to_provide the required interface). The view is not part of buffer, (as was proposed) but a separate piece. Still, I agree that we want a commonly used array object that includes both the memory and the view. I propose that we build it out of these more generic pieces, but also provide commonly used compositions of these pieces. I think this satisfies the desire for a self-describing array component, while allowing more flexibility and serving a wider usage.
![](https://secure.gravatar.com/avatar/5b2449484c19f8e037c5d9c71e429508.jpg?s=120&d=mm&r=g)
On 1/12/07, Travis Oliphant <oliphant@ee.byu.edu > wrote:
I'm concerned about the direction that this PEP seems to be going. The original proposal was borderline too complicated IMO, and now it seems headed in the direction of more complexity. Also, it seems that there are three different goals getting conflated here. None are bad, but they don't and probably shouldn't, all be addressed by the same PEP. 1. Allowing producers and consumers of blocks of data to share blocks efficiently. This is half of what the original PEP proposed. 2. Describing complex data types at the c-level. This is the other half of the PEP[1]. 3. Things that act like arrays, but have different storage methods. This details of this still seem pretty vague, but to the extent that I can figure them out, it doesn't seem useful or necessary to tie this into the rest of the array interface PEP. For example, "array_interface->get_block_from_slice()" has been mentioned. Why that instead of "PyObject_AsExtendedBuffer(PyObject_GetItem(index), ....)"[2]. I'll stop here, till I see some more details of what people have in mind, but at this point, I think that alternative memory models are a different problem that should be addressed separately. Sadly, I'm leaving town shortly and I'm running out of time, so I'll have to leave my objections in this somewhat vague state. Oh, the way that F. Lundh plans to expose PIL's data a chunk at a time is mentioned in this python-dev summary: http://www.python.org/dev/summary/2006-11-01_2006-11-15/ It doesn't seem necessary to have special support for this; all that is necessary is for the object returned by acquire_view to support the extended array protocol. [1] Remind me again why we can't simply use ctypes for this? It's already in the core. I'm sure it's less efficient, but you shouldn't need to parse the data structure information very often. I suspect that something that leveraged ctypes would meet less resistance. [2] Which reminds me. I never saw in the PEP what the actual call in the buffer protocol was supposed to look like. Is it something like: PyObject_AsExtendedBuffer(PyObject * obj, void **buffer, Py_ssize_t *buffer_len, funcptr *bf_getarrayview, funcptr *bf_relarrayview) ? -- //=][=\\ tim.hochberg@ieee.org
![](https://secure.gravatar.com/avatar/3f692386259303b90d1967ad662a9eb0.jpg?s=120&d=mm&r=g)
On 1/12/07, Timothy Hochberg <tim.hochberg@ieee.org> wrote:
Looks like an array, act like an array, smells like an array = is an array
What is an "Extended buffer" ? Connecting that to array information doesn't feel intuitive.
I agree
[1] Remind me again why we can't simply use ctypes for this?
1. ctypes is designed for "c types", not "array layout" 2. managing/creating complex formats in ctypes deviates from the clean, intuitive and simple (considerably compared to dtypes) => ugly code 3. Can ctypes handle anonymous lambda function pointers?
the core. I'm sure it's less efficient, but you shouldn't need to parse the data structure information very often.
I believe that'll be more common than you think; for example dynamically creating/combining/slicing recarrays with various data. //Torgil
![](https://secure.gravatar.com/avatar/5b2449484c19f8e037c5d9c71e429508.jpg?s=120&d=mm&r=g)
On 1/12/07, Torgil Svensson <torgil.svensson@gmail.com> wrote:
I was unclear here. I didn't mean like that it would be infrequent in "once a month" sense. I meant that you would only need to look at the data structure information once per set of data that you are accessing and that you would typically extract many chunks of data from each set, so the amortized cost of parsing the data structure would be small. Trying to get out the door.... -- //=][=\\ tim.hochberg@ieee.org
![](https://secure.gravatar.com/avatar/4d021a1d1319f36ad861ebef0eb5ba44.jpg?s=120&d=mm&r=g)
Well at least people are talking about what they would like to see. But, I think we should reign in the discussion.
I'm leaning this way too.
Yes, I agree.
Two reasons: 1) ctypes wasn't designed for this purpose specifically and leaves out certain things 2) ctypes uses many Python types instead of just a single python type (the PyDataFormatOjbect).
No, not like that. The bf_getarrayview function pointers hang-off of the as_buffer method table which is pointed to by the typeobject. You could always access the API using those function pointers, but it is more traditional to use an API call which adds some checking to make sure the function pointer is there and then calling the function pointer. I don't know if I go into a lot of detail there, but I should probably add more. PEP's are rather "expensive" for me in terms of how much immediate benefit the change to Python is to me personally versus that time spent writing them. The benefit here is much more long term in establishing a useful data-model that could be used by a lot of applications in Python to exchange data (and help ameliorate the proliferation of objects in Python that are essentially and should be NumPy arrays). -Travis
![](https://secure.gravatar.com/avatar/5c9fb379c4e97b58960d74dcbfc5dee5.jpg?s=120&d=mm&r=g)
Talking about the difference between the memory access model and the array API, maybe I am talking bullshit (I know next to nothing in these problems) but couldn' an efficient tree data structure be implemented on the memory buffer object ? I am pretty sure a simple tree read-only could, as for a tree that is edited, I am not so sure. Anyhow read-only tree are used a lot by some people. A lab next to mine uses them to describe results from there experiment. They store events in tree-like structures (I have been told that copied that from the CERN). They can then walk throught the tree in a very efficient way, and do statistical analysis on their collection of event. I am not sure if this can fit anywhere in the PEP, but it would sure enlarge its scope. Please enlight me. Gaël
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On 1/12/07, Gael Varoquaux <gael.varoquaux@normalesup.org> wrote:
Trees are nice, but they are not efficient for array type data. Traversing a tree usually requires some sort of stack (recursion), and a tree is not well structured for addressing data using indices. They just aren't appropriate for arrays, arrays are better represented by some sort of lattice. Anyhow read-only tree are used a lot by some people. A lab next to mine
Probably from ROOT? I am not sure if this can fit anywhere in the PEP, but it would sure
enlarge its scope.
There is probably a tree module somewhere for python. Chuck
![](https://secure.gravatar.com/avatar/5c9fb379c4e97b58960d74dcbfc5dee5.jpg?s=120&d=mm&r=g)
On Fri, Jan 12, 2007 at 12:44:15AM -0700, Charles R Harris wrote:
Yes, indeed. I was just wondering if the PEP could be used for a performant implementation of trees. Basicaly that is mapping a tree to an array, which is possible. As far as performance, I think this is not performant at all when modifying the tree, but I do not know if it is possible to have an efficient traversing of the tree when it is mapped to an array.
Probably from ROOT?
Yes. It seems a nice software for such things. The problem with it is that you have to learn C++, and experience shows that not everybody in an experimental lab is willing to do so. Gaël
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
Gael Varoquaux wrote:
Yes, indeed. I was just wondering if the PEP could be used for a performant implementation of trees.
That would be a whole new PEP, and one we're not the least bit ready for.
Basicaly that is mapping a tree to an array, which is possible.
Possible, but probably not very useful for dense data -- maybe for sparse arrays? The idea of an array API, rather than (actually in addition to) and array data structure is fabulous! It could be used for sparse arrays, for instance. I do think it's a topic for another PEP, and probably not even a PEP until we have at least some working code - maybe a sparse array and/or PIL image?
I think a slicing API is critical -- at least at the Python level, though at the C level is would sure be nice, and probably could allow for some good optimizations for getting a "block" of data out of some odd data structure. "Simple is better than complex." "Although practicality beats purity." "Now is better than never." Tells me that we should just focus on the array data structure for the PEP now. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
participants (15)
-
Charles R Harris
-
Christopher Barker
-
Colin J. Williams
-
Francesc Altet
-
Gael Varoquaux
-
Matthew Brett
-
Michael McLay
-
Neal Becker
-
Perry Greenfield
-
Robert Kern
-
Stefan van der Walt
-
Tim Hochberg
-
Timothy Hochberg
-
Torgil Svensson
-
Travis Oliphant