
I am trying to replace an old code (biliteral filter) that rely on ndimage.generic_filter with the neighborhood iterator. In the old code, the generic_filter generates a contiguous copy of the neighborhood, thus the (cython) code could use C loop to iterate over the neighbourhood copy. In the new code version the PyArrayNeighborhoodIter_Next must be called to retrieve every neighbourhood item. The results of rough benchmarking to compare bilateral filtering on a 1000x1000 array: Old code (ndimage.generic_filter): 16.5 sec New code (neighborhood iteration): 60.5 sec New code with PyArrayNeighborhoodIter_Next omitted: 1.5 sec * The last benchmark is not "real" since the omitted call is a must. It just demonstrates the iterator overhead. * I assune the main overhead in the old code is the python function callback process. There are instructions in the manual how to wrap a C code for a faster callback, but I rather use the neighbourhood iterator as I consider it as more generic. If the PyArrayNeighborhoodIter_Reset could (optionally) copy the relevant data (as the generic_filter does) it would provide a major speed up in many cases. Nadav

On Mon, Oct 24, 2011 at 6:57 AM, Nadav Horesh <nadavh@visionsense.com> wrote:
I am trying to replace an old code (biliteral filter) that rely on ndimage.generic_filter with the neighborhood iterator. In the old code, the generic_filter generates a contiguous copy of the neighborhood, thus the (cython) code could use C loop to iterate over the neighbourhood copy. In the new code version the PyArrayNeighborhoodIter_Next must be called to retrieve every neighbourhood item. The results of rough benchmarking to compare bilateral filtering on a 1000x1000 array: Old code (ndimage.generic_filter): 16.5 sec New code (neighborhood iteration): 60.5 sec New code with PyArrayNeighborhoodIter_Next omitted: 1.5 sec
* The last benchmark is not "real" since the omitted call is a must. It just demonstrates the iterator overhead. * I assune the main overhead in the old code is the python function callback process. There are instructions in the manual how to wrap a C code for a faster callback, but I rather use the neighbourhood iterator as I consider it as more generic.
I am afraid the cost is unavoidable: you are really trading cpu for memory. When using PyArrayNeighborhood_Next, there is a loop with a condiational within, and I don't think those can easily be avoided without losing genericity. Which mode are you using when creating the neighborhood iterator ? There used to be a PyArrayNeightborhoodIter_Next2d, I don't know why I commented out. You could try to see if you can get faster.
If the PyArrayNeighborhoodIter_Reset could (optionally) copy the relevant data (as the generic_filter does) it would provide a major speed up in many cases.
Optionally copying may be an option, but it would make more sense to do it at creation time than during reset, no ? Something like a binary and with the current mode flag, cheers, David

* Iterator mode: Mirror. Does the mode make a huge difference? * I can not find any reference to PyArrayNeightborhoodIter_Next2d, where can I find it? * I think that making a copy on reset is (maybe in addition to the creation), since there is a reset for every change of the parent iterator, and after this change, the neighborhood can be determined. * What do you think about the following idea? * A neighbourhood iterator generator that accepts also a buffer to copy in the neighbourhood. * A reset function that would refill the buffer after each parent iterator modification Nadav -----Original Message----- From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of David Cournapeau Sent: Monday, October 24, 2011 9:38 AM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] neighborhood iterator speed On Mon, Oct 24, 2011 at 6:57 AM, Nadav Horesh <nadavh@visionsense.com> wrote:
I am trying to replace an old code (biliteral filter) that rely on ndimage.generic_filter with the neighborhood iterator. In the old code, the generic_filter generates a contiguous copy of the neighborhood, thus the (cython) code could use C loop to iterate over the neighbourhood copy. In the new code version the PyArrayNeighborhoodIter_Next must be called to retrieve every neighbourhood item. The results of rough benchmarking to compare bilateral filtering on a 1000x1000 array: Old code (ndimage.generic_filter): 16.5 sec New code (neighborhood iteration): 60.5 sec New code with PyArrayNeighborhoodIter_Next omitted: 1.5 sec
* The last benchmark is not "real" since the omitted call is a must. It just demonstrates the iterator overhead. * I assune the main overhead in the old code is the python function callback process. There are instructions in the manual how to wrap a C code for a faster callback, but I rather use the neighbourhood iterator as I consider it as more generic.
I am afraid the cost is unavoidable: you are really trading cpu for memory. When using PyArrayNeighborhood_Next, there is a loop with a condiational within, and I don't think those can easily be avoided without losing genericity. Which mode are you using when creating the neighborhood iterator ? There used to be a PyArrayNeightborhoodIter_Next2d, I don't know why I commented out. You could try to see if you can get faster.
If the PyArrayNeighborhoodIter_Reset could (optionally) copy the relevant data (as the generic_filter does) it would provide a major speed up in many cases.
Optionally copying may be an option, but it would make more sense to do it at creation time than during reset, no ? Something like a binary and with the current mode flag, cheers, David _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion __________ Information from ESET NOD32 Antivirus, version of virus signature database 4628 (20091122) __________ The message was checked by ESET NOD32 Antivirus. http://www.eset.com

On Mon, Oct 24, 2011 at 10:48 AM, Nadav Horesh <nadavh@visionsense.com> wrote:
* Iterator mode: Mirror. Does the mode make a huge difference?
It could, at least in principle. The underlying translate function is called often enough that a slight different can be significant.
* I can not find any reference to PyArrayNeightborhoodIter_Next2d, where can I find it?
I think it would look like: static NPY_INLINE int PyArrayNeighborhoodIter_Next2d(PyArrayNeighborhoodIterObject* iter) { _PyArrayNeighborhoodIter_IncrCoord2d(iter); iter->dataptr = iter->translate((PyArrayIterObject*)iter, iter->coordinates); return 0; } The ...IncrCoord2 macro avoid one loop, which may be useful (or not). The big issue here is the translate method call that cannot be inlined because of the "polymorphism" of neighborhood iterator. But the only way to avoid this would be to have many different iterators so that the underlying translate function is known. Copying the data makes the call to translate unnecessary (but adds the penalty of one more conditional on every PyArrayNeighborhood_Next.
* I think that making a copy on reset is (maybe in addition to the creation), since there is a reset for every change of the parent iterator, and after this change, the neighborhood can be determined.
you're right of course, I forgot about the parent iterator.
* What do you think about the following idea? * A neighbourhood iterator generator that accepts also a buffer to copy in the neighbourhood. * A reset function that would refill the buffer after each parent iterator modification
The issue with giving the buffer is that one needs to be carefull about the size and all. What's your usecase to pass the buffer ? David

* I'll try to implement the 2D iterator as far as far as my programming expertise goes. It might take few days. * There is a risk in providing a buffer pointer, and for my (and probably most) use cases it is better for the iterator constructor to provide it. I was thinking about the possibility to give the iterator a shared memory pointer, to open a door for multiprocessing. Maybe it is better instead to provide a contiguous ndarray object to enable a sanity check. Nadav. -----Original Message----- From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of David Cournapeau Sent: Monday, October 24, 2011 1:57 PM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] neighborhood iterator speed On Mon, Oct 24, 2011 at 10:48 AM, Nadav Horesh <nadavh@visionsense.com> wrote:
* Iterator mode: Mirror. Does the mode make a huge difference?
It could, at least in principle. The underlying translate function is called often enough that a slight different can be significant.
* I can not find any reference to PyArrayNeightborhoodIter_Next2d, where can I find it?
I think it would look like: static NPY_INLINE int PyArrayNeighborhoodIter_Next2d(PyArrayNeighborhoodIterObject* iter) { _PyArrayNeighborhoodIter_IncrCoord2d(iter); iter->dataptr = iter->translate((PyArrayIterObject*)iter, iter->coordinates); return 0; } The ...IncrCoord2 macro avoid one loop, which may be useful (or not). The big issue here is the translate method call that cannot be inlined because of the "polymorphism" of neighborhood iterator. But the only way to avoid this would be to have many different iterators so that the underlying translate function is known. Copying the data makes the call to translate unnecessary (but adds the penalty of one more conditional on every PyArrayNeighborhood_Next.
* I think that making a copy on reset is (maybe in addition to the creation), since there is a reset for every change of the parent iterator, and after this change, the neighborhood can be determined.
you're right of course, I forgot about the parent iterator.
* What do you think about the following idea? * A neighbourhood iterator generator that accepts also a buffer to copy in the neighbourhood. * A reset function that would refill the buffer after each parent iterator modification
The issue with giving the buffer is that one needs to be carefull about the size and all. What's your usecase to pass the buffer ? David _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion __________ Information from ESET NOD32 Antivirus, version of virus signature database 4628 (20091122) __________ The message was checked by ESET NOD32 Antivirus. http://www.eset.com

On Mon, Oct 24, 2011 at 1:23 PM, Nadav Horesh <nadavh@visionsense.com> wrote:
* I'll try to implement the 2D iterator as far as far as my programming expertise goes. It might take few days.
I am pretty sure the code is in the history, if you are patient enough to look for it in git history. I can't remember why I removed it (maybe because it was not faster ?).
* There is a risk in providing a buffer pointer, and for my (and probably most) use cases it is better for the iterator constructor to provide it. I was thinking about the possibility to give the iterator a shared memory pointer, to open a door for multiprocessing. Maybe it is better instead to provide a contiguous ndarray object to enable a sanity check.
One could ask for an optional buffer (if NULL -> auto-allocation). But I would need a more detailed explanation about what you are trying to do to warrant changing the API here. cheers, David

My use case is a biliterl filter: It is a convolution-like filter used mainly in image-processing, which may use relatively large convolution kernels (in the order of 50x50). I would like to run the inner loop (iteration over the neighbourhood) with a direct indexing (in a cython code) rather then using the slow iterator, in order to save time. A separate issue is the new cython's parallel loop that raises the need for GIL-free numpy iterators (I might be wrong though). Anyway, it is not urgent for me. Nadav -----Original Message----- From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of David Cournapeau Sent: Monday, October 24, 2011 4:04 PM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] neighborhood iterator speed On Mon, Oct 24, 2011 at 1:23 PM, Nadav Horesh <nadavh@visionsense.com> wrote:
* I'll try to implement the 2D iterator as far as far as my programming expertise goes. It might take few days.
I am pretty sure the code is in the history, if you are patient enough to look for it in git history. I can't remember why I removed it (maybe because it was not faster ?).
* There is a risk in providing a buffer pointer, and for my (and probably most) use cases it is better for the iterator constructor to provide it. I was thinking about the possibility to give the iterator a shared memory pointer, to open a door for multiprocessing. Maybe it is better instead to provide a contiguous ndarray object to enable a sanity check.
One could ask for an optional buffer (if NULL -> auto-allocation). But I would need a more detailed explanation about what you are trying to do to warrant changing the API here. cheers, David _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion __________ Information from ESET NOD32 Antivirus, version of virus signature database 4628 (20091122) __________ The message was checked by ESET NOD32 Antivirus. http://www.eset.com

I found the 2d iterator definition active in numpy 1.6.1. I'll test it. Nadav ________________________________________ From: numpy-discussion-bounces@scipy.org [numpy-discussion-bounces@scipy.org] On Behalf Of David Cournapeau [cournape@gmail.com] Sent: 24 October 2011 16:04 To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] neighborhood iterator speed On Mon, Oct 24, 2011 at 1:23 PM, Nadav Horesh <nadavh@visionsense.com> wrote:
* I'll try to implement the 2D iterator as far as far as my programming expertise goes. It might take few days.
I am pretty sure the code is in the history, if you are patient enough to look for it in git history. I can't remember why I removed it (maybe because it was not faster ?).
* There is a risk in providing a buffer pointer, and for my (and probably most) use cases it is better for the iterator constructor to provide it. I was thinking about the possibility to give the iterator a shared memory pointer, to open a door for multiprocessing. Maybe it is better instead to provide a contiguous ndarray object to enable a sanity check.
One could ask for an optional buffer (if NULL -> auto-allocation). But I would need a more detailed explanation about what you are trying to do to warrant changing the API here. cheers, David _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Finally managed to use PyArrayNeighborhoodIter_Next2D with numpy 1.5.0 (in numpy 1.6 it doesn't get along with halffloat). Benchmark results (not the same computer and parameters I used in the previous benchmark): 1. ...Next2D (zero padding, it doesn't accept mirror padding): 10 sec 2. ...Next (zero padding): 53 sec 3. ...Next (mirror padding): 128 sec Remarks: 1. I did not check the validity of the results 2. Mirror padding is preferable for my specific case. What does it mean for the potential for the neighbourhood iterator acceleration? Nadav. -----Original Message----- From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of Nadav Horesh Sent: Monday, October 24, 2011 9:02 PM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] neighborhood iterator speed I found the 2d iterator definition active in numpy 1.6.1. I'll test it. Nadav ________________________________________ From: numpy-discussion-bounces@scipy.org [numpy-discussion-bounces@scipy.org] On Behalf Of David Cournapeau [cournape@gmail.com] Sent: 24 October 2011 16:04 To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] neighborhood iterator speed On Mon, Oct 24, 2011 at 1:23 PM, Nadav Horesh <nadavh@visionsense.com> wrote:
* I'll try to implement the 2D iterator as far as far as my programming expertise goes. It might take few days.
I am pretty sure the code is in the history, if you are patient enough to look for it in git history. I can't remember why I removed it (maybe because it was not faster ?).
* There is a risk in providing a buffer pointer, and for my (and probably most) use cases it is better for the iterator constructor to provide it. I was thinking about the possibility to give the iterator a shared memory pointer, to open a door for multiprocessing. Maybe it is better instead to provide a contiguous ndarray object to enable a sanity check.
One could ask for an optional buffer (if NULL -> auto-allocation). But I would need a more detailed explanation about what you are trying to do to warrant changing the API here. cheers, David _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion __________ Information from ESET NOD32 Antivirus, version of virus signature database 4628 (20091122) __________ The message was checked by ESET NOD32 Antivirus. http://www.eset.com
participants (2)
-
David Cournapeau
-
Nadav Horesh