![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
I spend some time seeing what I could do in the way of speeding up wxPoint_LIST_helper by tweaking the numarray code. My first suspect was _universalIndexing by way of _ndarray_item. However, due to some new-style machinations, _ndarray_item was never getting called. Instead, _ndarray_subscript was being called. So, I added a special case to _ndarray_subscript. This sped things up by 50% or so (I don't recall exactly). The code for that is at the end of this message; it's not gauranteed to be 100% correct; it's all experimental. After futzing around some more I figured out a way to trick python into using _ndarray_item. I added "type->tp_as_sequence->sq_item = _ndarray_item;" to _ndarray new. I then optimized _ndarray_item (code at end). This halved the execution time of my arbitrary benchmark. This trick may have horrible, unforseen consequences so use at your own risk. Finally I commented out the __del__ method numarraycore. This resulted in an additional speedup of 64% for a total speed up of 240%. Still not close to 10x, but a large improvement. However, this is obviously not viable for real use, but it's enough of a speedup that I'll try to see if there's anyway to move the shadow stuff back to tp_dealloc. In summary: Version Time Rel Speedup Abs Speedup Stock 0.398 ---- ---- _naarray_item mod 0.192 107% 107% del __del__ 0.117 64% 240% There were a couple of other things I tried that resulted in additional small speedups, but the tactics I used were too horrible to reproduce here. The main one of interest is that all of the calls to NA_updateDataPtr seem to burn some time. However, I don't have any idea what one could do about that. That's all for now. -tim static PyObject* _ndarray_subscript(PyArrayObject* self, PyObject* key) { PyObject *result; #ifdef TAH if (PyInt_CheckExact(key)) { long ikey = PyInt_AsLong(key); long offset; if (NA_getByteOffset(self, 1, &ikey, &offset) < 0) return NULL; if (!NA_updateDataPtr(self)) return NULL; return _simpleIndexingCore(self, offset, 1, Py_None); } #endif #if _PYTHON_CALLBACKS result = PyObject_CallMethod( (PyObject *) self, "_universalIndexing", "(OO)", key, Py_None); #else result = _universalIndexing(self, key, Py_None); #endif return result; } static PyObject * _ndarray_item(PyArrayObject *self, int i) { #ifdef TAH long offset; if (NA_getByteOffset(self, 1, &i, &offset) < 0) return NULL; if (!NA_updateDataPtr(self)) return NULL; return _simpleIndexingCore(self, offset, 1, Py_None); #else PyObject *result; PyObject *key = PyInt_FromLong(i); if (!key) return NULL; result = _universalIndexing(self, key, Py_None); Py_DECREF(key); return result; #endif }
![](https://secure.gravatar.com/avatar/faf9400121dca9940496a7473b1d8179.jpg?s=120&d=mm&r=g)
On Wed, 2004-06-30 at 15:57, Tim Hochberg wrote:
I'm puzzled why you had to do this. You're using Python-2.3.x, right? There's conditionally compiled code which should be doing this statically. (At least I thought so.)
Right now the sq_item hack strikes me as somewhere between completely unnecessary and too scary for me! Maybe if python-dev blessed it. This optimization looks good to me.
FYI, the issue with tp_dealloc may have to do with which mode Python is compiled in, --with-pydebug, or not. One approach which seems like it ought to work (just thought of this!) is to add an extra reference in C to the NumArray instance __dict__ (from NumArray.__init__ and stashed via a new attribute in the PyArrayObject struct) and then DECREF it as the last part of the tp_dealloc.
Francesc Alted had the same comment about NA_updateDataPtr a while ago. I tried to optimize it then but didn't get anywhere. NA_updateDataPtr() should be called at most once per extension function (more is unnecessary but not harmful) but needs to be called at least once as a consequence of the way the buffer protocol doesn't give locked pointers.
That's all for now.
-tim
Well, be picking out your beer. Todd
--
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Todd Miller wrote:
By this do you mean the "#if PY_VERSION_HEX >= 0x02030000 " that is wrapped around _ndarray_item? If so, I believe that it *is* getting compiled, it's just never getting called. What I think is happening is that the class NumArray inherits its sq_item from PyClassObject. In particular, I think it picks up instance_item from Objects/classobject.c. This appears to be fairly expensive and, I think, ends up calling tp_as_mapping->mp_subscript. Thus, _ndarray's sq_item slot never gets called. All of this is pretty iffy since I don't know this stuff very well and I didn't trace it all the way through. However, it explains what I've seen thus far. This is why I ended up using the horrible hack. I'm resetting NumArray's sq_item to point to _ndarray_item instead of instance_item. I believe that access at the python level goes through mp_subscrip, so it shouldn't be affected, and only objects at the C level should notice and they should just get the faster sq_item. You, will notice that there are an awful lot of I thinks in the above paragraphs though...
Yes, very scary. And it occurs to me that it will break subclasses of NumArray if they override __getitem__. When these subclasses are accessed from C they will see nd_array's sq_item instead of the overridden getitem. However, I think I also know how to fix it. But it does point out that it is very dangerous and there are probably dark corners of which I'm unaware. Asking on Python-List or PyDev would probably be a good idea. The nonscary, but painful, fix would to rewrite NumArray in C.
This optimization looks good to me.
Unfortunately, I don't think the optimization to sq_item will affect much since NumArray appears to override it with
That sounds promising. [SNIP]
Well, be picking out your beer.
I was only about half right, so I'm not sure I qualify... -tim
![](https://secure.gravatar.com/avatar/faf9400121dca9940496a7473b1d8179.jpg?s=120&d=mm&r=g)
On Wed, 2004-06-30 at 19:00, Tim Hochberg wrote:
Ugh... Thanks for explaining this.
Non-scary to whom?
I looked at this some, and while INCREFing __dict__ maybe the right idea, I forgot that there *is no* Python NumArray.__init__ anymore. So the INCREF needs to be done in C without doing any getattrs; this seems to mean calling a private _PyObject_GetDictPtr function to get a pointer to the __dict__ slot which can be dereferenced to get the __dict__.
We could always reduce your wages to a 12-pack... Todd
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Todd Miller wrote:
Might there be a simpler way? Since you're putting an extra attribute on the PyArrayObject structure anyway, wouldn't it be possible to just stash _shadows there instead of the reference to the dictionary? It appears that that the only time _shadows is accessed from python is in __del__. If it were instead an attribute on ndarray, the dealloc problem would go away since the responsibility for deallocing it would fall to ndarray. Since everything else accesses it from C, that shouldn't be much of a problem and should speed that stuff up as well. -tim
![](https://secure.gravatar.com/avatar/faf9400121dca9940496a7473b1d8179.jpg?s=120&d=mm&r=g)
On Thu, 2004-07-01 at 14:51, Tim Hochberg wrote:
_shadows is already in the struct. The root problem (I recall) is not the loss of self->_shadows, it's the loss self->__dict__ before self can be copied onto self->_shadows. The cause of the problem appeared to me to be the tear down order of self: the NumArray part appeared to be torn down before the _numarray part, and the tp_dealloc needs to do a Python callback where a half destructed object just won't do. To really know what the problem is, I need to stick tp_dealloc back in and see what breaks. I'm pretty sure the problem was a missing instance __dict__, but my memory is quite fallable. Todd
![](https://secure.gravatar.com/avatar/81b3970c8247b2521d2f814de5b24475.jpg?s=120&d=mm&r=g)
A Dimecres 30 Juny 2004 23:47, Todd Miller va escriure:
FYI I'm still refusing to call NA_updateDataPtr() in a spoecific part of my code that requires as much speed as possible. It works just fine from numarray 0.5 on (numarray 0.4 gave a segmentation fault on that). However, Todd already warned me about that and told me that this is unsafe. Nevertheless, I'm using the optimization for read-only purposes (i.e. they are not accessible to users) over numarray objects, and that *seems* to be safe (at least I did not have any single problem after numarray 0.5). I know that I'm walking on the cutting edge, but life is dangerous anyway ;). By the way, that optimization gives me a 70% of improvement during element access to NumArray elements. It would be very nice if you finally can achieve additional performance with your recent bet :). Good luck!, -- Francesc Alted
![](https://secure.gravatar.com/avatar/faf9400121dca9940496a7473b1d8179.jpg?s=120&d=mm&r=g)
On Wed, 2004-06-30 at 15:57, Tim Hochberg wrote:
I'm puzzled why you had to do this. You're using Python-2.3.x, right? There's conditionally compiled code which should be doing this statically. (At least I thought so.)
Right now the sq_item hack strikes me as somewhere between completely unnecessary and too scary for me! Maybe if python-dev blessed it. This optimization looks good to me.
FYI, the issue with tp_dealloc may have to do with which mode Python is compiled in, --with-pydebug, or not. One approach which seems like it ought to work (just thought of this!) is to add an extra reference in C to the NumArray instance __dict__ (from NumArray.__init__ and stashed via a new attribute in the PyArrayObject struct) and then DECREF it as the last part of the tp_dealloc.
Francesc Alted had the same comment about NA_updateDataPtr a while ago. I tried to optimize it then but didn't get anywhere. NA_updateDataPtr() should be called at most once per extension function (more is unnecessary but not harmful) but needs to be called at least once as a consequence of the way the buffer protocol doesn't give locked pointers.
That's all for now.
-tim
Well, be picking out your beer. Todd
--
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Todd Miller wrote:
By this do you mean the "#if PY_VERSION_HEX >= 0x02030000 " that is wrapped around _ndarray_item? If so, I believe that it *is* getting compiled, it's just never getting called. What I think is happening is that the class NumArray inherits its sq_item from PyClassObject. In particular, I think it picks up instance_item from Objects/classobject.c. This appears to be fairly expensive and, I think, ends up calling tp_as_mapping->mp_subscript. Thus, _ndarray's sq_item slot never gets called. All of this is pretty iffy since I don't know this stuff very well and I didn't trace it all the way through. However, it explains what I've seen thus far. This is why I ended up using the horrible hack. I'm resetting NumArray's sq_item to point to _ndarray_item instead of instance_item. I believe that access at the python level goes through mp_subscrip, so it shouldn't be affected, and only objects at the C level should notice and they should just get the faster sq_item. You, will notice that there are an awful lot of I thinks in the above paragraphs though...
Yes, very scary. And it occurs to me that it will break subclasses of NumArray if they override __getitem__. When these subclasses are accessed from C they will see nd_array's sq_item instead of the overridden getitem. However, I think I also know how to fix it. But it does point out that it is very dangerous and there are probably dark corners of which I'm unaware. Asking on Python-List or PyDev would probably be a good idea. The nonscary, but painful, fix would to rewrite NumArray in C.
This optimization looks good to me.
Unfortunately, I don't think the optimization to sq_item will affect much since NumArray appears to override it with
That sounds promising. [SNIP]
Well, be picking out your beer.
I was only about half right, so I'm not sure I qualify... -tim
![](https://secure.gravatar.com/avatar/faf9400121dca9940496a7473b1d8179.jpg?s=120&d=mm&r=g)
On Wed, 2004-06-30 at 19:00, Tim Hochberg wrote:
Ugh... Thanks for explaining this.
Non-scary to whom?
I looked at this some, and while INCREFing __dict__ maybe the right idea, I forgot that there *is no* Python NumArray.__init__ anymore. So the INCREF needs to be done in C without doing any getattrs; this seems to mean calling a private _PyObject_GetDictPtr function to get a pointer to the __dict__ slot which can be dereferenced to get the __dict__.
We could always reduce your wages to a 12-pack... Todd
![](https://secure.gravatar.com/avatar/55f7acf47233a7a98f5eb9dfd0b2d763.jpg?s=120&d=mm&r=g)
Todd Miller wrote:
Might there be a simpler way? Since you're putting an extra attribute on the PyArrayObject structure anyway, wouldn't it be possible to just stash _shadows there instead of the reference to the dictionary? It appears that that the only time _shadows is accessed from python is in __del__. If it were instead an attribute on ndarray, the dealloc problem would go away since the responsibility for deallocing it would fall to ndarray. Since everything else accesses it from C, that shouldn't be much of a problem and should speed that stuff up as well. -tim
![](https://secure.gravatar.com/avatar/faf9400121dca9940496a7473b1d8179.jpg?s=120&d=mm&r=g)
On Thu, 2004-07-01 at 14:51, Tim Hochberg wrote:
_shadows is already in the struct. The root problem (I recall) is not the loss of self->_shadows, it's the loss self->__dict__ before self can be copied onto self->_shadows. The cause of the problem appeared to me to be the tear down order of self: the NumArray part appeared to be torn down before the _numarray part, and the tp_dealloc needs to do a Python callback where a half destructed object just won't do. To really know what the problem is, I need to stick tp_dealloc back in and see what breaks. I'm pretty sure the problem was a missing instance __dict__, but my memory is quite fallable. Todd
![](https://secure.gravatar.com/avatar/81b3970c8247b2521d2f814de5b24475.jpg?s=120&d=mm&r=g)
A Dimecres 30 Juny 2004 23:47, Todd Miller va escriure:
FYI I'm still refusing to call NA_updateDataPtr() in a spoecific part of my code that requires as much speed as possible. It works just fine from numarray 0.5 on (numarray 0.4 gave a segmentation fault on that). However, Todd already warned me about that and told me that this is unsafe. Nevertheless, I'm using the optimization for read-only purposes (i.e. they are not accessible to users) over numarray objects, and that *seems* to be safe (at least I did not have any single problem after numarray 0.5). I know that I'm walking on the cutting edge, but life is dangerous anyway ;). By the way, that optimization gives me a 70% of improvement during element access to NumArray elements. It would be very nice if you finally can achieve additional performance with your recent bet :). Good luck!, -- Francesc Alted
participants (3)
-
Francesc Alted
-
Tim Hochberg
-
Todd Miller