Rewrite np.histogram in c?

Hi, Numpy.histogram is implemented in python, and is a little sluggish. This has been discussed previously on the mailing list, [1, 2]. It came up in a project that I maintain, where a new feature is bottlenecked by numpy.histogram, and one developer suggested a faster implementation in cython [3]. Would it make sense to reimplement this function in c? or cython? Is moving functions like this from python to c to improve performance within the scope of the development roadmap for numpy? I started implementing this a little bit in c, [4] but I figured I should check in here first. -Robert [1] http://scipy-user.10969.n7.nabble.com/numpy-histogram-is-slow-td17208.html [2] http://numpy-discussion.10968.n7.nabble.com/Fast-histogram-td9359.html [3] https://github.com/mdtraj/mdtraj/pull/734 [4] https://github.com/rmcgibbo/numpy/tree/histogram

On Sun, Mar 15, 2015 at 9:32 PM, Robert McGibbon <rmcgibbo@gmail.com> wrote:
Hi,
Numpy.histogram is implemented in python, and is a little sluggish. This has been discussed previously on the mailing list, [1, 2]. It came up in a project that I maintain, where a new feature is bottlenecked by numpy.histogram, and one developer suggested a faster implementation in cython [3].
Would it make sense to reimplement this function in c? or cython? Is moving functions like this from python to c to improve performance within the scope of the development roadmap for numpy? I started implementing this a little bit in c, [4] but I figured I should check in here first.
Where do you think the performance gains will come from? The PR in your project that claims a 10x speed-up uses a method that is only fit for equally spaced bins. I want to think that implementing that exact same algorithm in Python with NumPy would be comparably fast, say within 2x. For the general case, NumPy is already doing most of the heavy lifting (the sorting and the searching) in C: simply replicating the same algorithmic approach entirely in C is unlikely to provide any major speed-up. And if the change is to the algorithm, then we should first try it out in Python. That said, if you can speed things up 10x, I don't think there is going to be much opposition to moving it to C! Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.

It might make sense to dispatch to difference c implements if the bins are equally spaced (as created by using an integer for the np.histogram bins argument), vs. non-equally-spaced bins. In that case, getting the bigger speedup may be easier, at least for one common use case. -Robert On Sun, Mar 15, 2015 at 11:00 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Sun, Mar 15, 2015 at 9:32 PM, Robert McGibbon <rmcgibbo@gmail.com> wrote:
Hi,
Numpy.histogram is implemented in python, and is a little sluggish. This has been discussed previously on the mailing list, [1, 2]. It came up in a project that I maintain, where a new feature is bottlenecked by numpy.histogram, and one developer suggested a faster implementation in cython [3].
Would it make sense to reimplement this function in c? or cython? Is moving functions like this from python to c to improve performance within the scope of the development roadmap for numpy? I started implementing this a little bit in c, [4] but I figured I should check in here first.
Where do you think the performance gains will come from? The PR in your project that claims a 10x speed-up uses a method that is only fit for equally spaced bins. I want to think that implementing that exact same algorithm in Python with NumPy would be comparably fast, say within 2x.
For the general case, NumPy is already doing most of the heavy lifting (the sorting and the searching) in C: simply replicating the same algorithmic approach entirely in C is unlikely to provide any major speed-up. And if the change is to the algorithm, then we should first try it out in Python.
That said, if you can speed things up 10x, I don't think there is going to be much opposition to moving it to C!
Jaime
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

My apologies for the typo: 'implements' -> 'implementations' -Robert On Sun, Mar 15, 2015 at 11:06 PM, Robert McGibbon <rmcgibbo@gmail.com> wrote:
It might make sense to dispatch to difference c implements if the bins are equally spaced (as created by using an integer for the np.histogram bins argument), vs. non-equally-spaced bins.
In that case, getting the bigger speedup may be easier, at least for one common use case.
-Robert
On Sun, Mar 15, 2015 at 11:00 PM, Jaime Fernández del Río < jaime.frio@gmail.com> wrote:
On Sun, Mar 15, 2015 at 9:32 PM, Robert McGibbon <rmcgibbo@gmail.com> wrote:
Hi,
Numpy.histogram is implemented in python, and is a little sluggish. This has been discussed previously on the mailing list, [1, 2]. It came up in a project that I maintain, where a new feature is bottlenecked by numpy.histogram, and one developer suggested a faster implementation in cython [3].
Would it make sense to reimplement this function in c? or cython? Is moving functions like this from python to c to improve performance within the scope of the development roadmap for numpy? I started implementing this a little bit in c, [4] but I figured I should check in here first.
Where do you think the performance gains will come from? The PR in your project that claims a 10x speed-up uses a method that is only fit for equally spaced bins. I want to think that implementing that exact same algorithm in Python with NumPy would be comparably fast, say within 2x.
For the general case, NumPy is already doing most of the heavy lifting (the sorting and the searching) in C: simply replicating the same algorithmic approach entirely in C is unlikely to provide any major speed-up. And if the change is to the algorithm, then we should first try it out in Python.
That said, if you can speed things up 10x, I don't think there is going to be much opposition to moving it to C!
Jaime
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Sun, Mar 15, 2015 at 11:06 PM, Robert McGibbon <rmcgibbo@gmail.com> wrote:
It might make sense to dispatch to difference c implements if the bins are equally spaced (as created by using an integer for the np.histogram bins argument), vs. non-equally-spaced bins.
Dispatching to a different method seems like a no brainer indeed. The question is whether we really need to do this in C. Maybe for some very specific case or cases it makes sense to have a super fast C path, e,g. no weights and bins is an integer. Even then, rather than rewriting the whole thing in C, it may be a better idea to leave the parsing of the inputs in Python, and have a C helper function wrapped and privately exposed, similarly to how `np.core.multiarray.interp` is used by `np.interp`. But I would still first give it a try in Python... Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.

On Mon, 16 Mar 2015 06:56:58 -0700 Jaime Fernández del Río <jaime.frio@gmail.com> wrote:
Dispatching to a different method seems like a no brainer indeed. The question is whether we really need to do this in C.
I need to do both unweighted & weighted histograms and we got a factor 5 using (simple) cython: it is in the proceedings of Euroscipy, last year. http://arxiv.org/pdf/1412.6367.pdf We got much faster but that's another story. In fact, many people coming from IDL or Matlab are surprised by the poor performances of numpy's histogram. Cheers -- Jérôme Kieffer tel +33 476 882 445

On Mon, Mar 16, 2015 at 9:28 AM, Jerome Kieffer <Jerome.Kieffer@esrf.fr> wrote:
On Mon, 16 Mar 2015 06:56:58 -0700 Jaime Fernández del Río <jaime.frio@gmail.com> wrote:
Dispatching to a different method seems like a no brainer indeed. The question is whether we really need to do this in C.
I need to do both unweighted & weighted histograms and we got a factor 5 using (simple) cython: it is in the proceedings of Euroscipy, last year. http://arxiv.org/pdf/1412.6367.pdf
If I read your paper and code properly, you got 5x faster, mostly because you combined the weighted and unweighted histograms into a single search of the array, and because you used an algorithm that can only be applied to equal- sized bins, similarly to the 10x speed-up Robert was reporting. I think that having a special path for equal sized bins is a great idea: let's do it, PRs are always welcome! Similarly, getting the counts together with the weights seems like a very good idea. I also think that writing it in Python is going to take us 80% of the way there: most of the improvements both of you have reported are not likely to be coming from the language chosen, but from the algorithm used. And if C proves to be sufficiently faster to warrant using it, it should be confined to the number crunching: I don;t think there is any point in rewriting argument parsing in C. Also, keep in mind `np.histogram` can now handle arrays of just about **any** dtype. Handling that complexity in C is not a ride in the park. Other functions like `np.bincount` and `np.digitize` cheat by only handling `double` typed arrays, a luxury that histogram probably can't afford at this point in time. Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.

Hi, It sounds like putting together a PR makes sense then. I'll try hacking on this a bit. -Robert On Mar 16, 2015 11:20 AM, "Jaime Fernández del Río" <jaime.frio@gmail.com> wrote:
On Mon, Mar 16, 2015 at 9:28 AM, Jerome Kieffer <Jerome.Kieffer@esrf.fr> wrote:
On Mon, 16 Mar 2015 06:56:58 -0700 Jaime Fernández del Río <jaime.frio@gmail.com> wrote:
Dispatching to a different method seems like a no brainer indeed. The question is whether we really need to do this in C.
I need to do both unweighted & weighted histograms and we got a factor 5 using (simple) cython: it is in the proceedings of Euroscipy, last year. http://arxiv.org/pdf/1412.6367.pdf
If I read your paper and code properly, you got 5x faster, mostly because you combined the weighted and unweighted histograms into a single search of the array, and because you used an algorithm that can only be applied to equal- sized bins, similarly to the 10x speed-up Robert was reporting.
I think that having a special path for equal sized bins is a great idea: let's do it, PRs are always welcome! Similarly, getting the counts together with the weights seems like a very good idea.
I also think that writing it in Python is going to take us 80% of the way there: most of the improvements both of you have reported are not likely to be coming from the language chosen, but from the algorithm used. And if C proves to be sufficiently faster to warrant using it, it should be confined to the number crunching: I don;t think there is any point in rewriting argument parsing in C.
Also, keep in mind `np.histogram` can now handle arrays of just about **any** dtype. Handling that complexity in C is not a ride in the park. Other functions like `np.bincount` and `np.digitize` cheat by only handling `double` typed arrays, a luxury that histogram probably can't afford at this point in time.
Jaime
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Hope this isn't too off-topic: but it would be very nice if np.histogram and np.histogram2d supported masked arrays. Is this out of scope for outside the numpy.ma package? On Mon, Mar 16, 2015 at 2:35 PM, Robert McGibbon <rmcgibbo@gmail.com> wrote:
Hi,
It sounds like putting together a PR makes sense then. I'll try hacking on this a bit.
-Robert On Mar 16, 2015 11:20 AM, "Jaime Fernández del Río" <jaime.frio@gmail.com> wrote:
On Mon, Mar 16, 2015 at 9:28 AM, Jerome Kieffer <Jerome.Kieffer@esrf.fr> wrote:
On Mon, 16 Mar 2015 06:56:58 -0700 Jaime Fernández del Río <jaime.frio@gmail.com> wrote:
Dispatching to a different method seems like a no brainer indeed. The question is whether we really need to do this in C.
I need to do both unweighted & weighted histograms and we got a factor 5 using (simple) cython: it is in the proceedings of Euroscipy, last year. http://arxiv.org/pdf/1412.6367.pdf
If I read your paper and code properly, you got 5x faster, mostly because you combined the weighted and unweighted histograms into a single search of the array, and because you used an algorithm that can only be applied to equal- sized bins, similarly to the 10x speed-up Robert was reporting.
I think that having a special path for equal sized bins is a great idea: let's do it, PRs are always welcome! Similarly, getting the counts together with the weights seems like a very good idea.
I also think that writing it in Python is going to take us 80% of the way there: most of the improvements both of you have reported are not likely to be coming from the language chosen, but from the algorithm used. And if C proves to be sufficiently faster to warrant using it, it should be confined to the number crunching: I don;t think there is any point in rewriting argument parsing in C.
Also, keep in mind `np.histogram` can now handle arrays of just about **any** dtype. Handling that complexity in C is not a ride in the park. Other functions like `np.bincount` and `np.digitize` cheat by only handling `double` typed arrays, a luxury that histogram probably can't afford at this point in time.
Jaime
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Mon, Mar 23, 2015 at 2:59 PM, Daniel da Silva <var.mail.daniel@gmail.com> wrote:
Hope this isn't too off-topic: but it would be very nice if np.histogram and np.histogram2d supported masked arrays. Is this out of scope for outside the numpy.ma package?
Right now it looks like there's no histogram function at all for masked arrays - would be good to improve that situation. If it's as easy as adding to np.histogram something like: if isinstance(a, np.ma.MaskedArray): a = a.data[~a.mask] then it makes sense to add that I think. Ralf
On Mon, Mar 16, 2015 at 2:35 PM, Robert McGibbon <rmcgibbo@gmail.com> wrote:
Hi,
It sounds like putting together a PR makes sense then. I'll try hacking on this a bit.
-Robert On Mar 16, 2015 11:20 AM, "Jaime Fernández del Río" <jaime.frio@gmail.com> wrote:
On Mon, Mar 16, 2015 at 9:28 AM, Jerome Kieffer <Jerome.Kieffer@esrf.fr> wrote:
On Mon, 16 Mar 2015 06:56:58 -0700 Jaime Fernández del Río <jaime.frio@gmail.com> wrote:
Dispatching to a different method seems like a no brainer indeed. The question is whether we really need to do this in C.
I need to do both unweighted & weighted histograms and we got a factor 5 using (simple) cython: it is in the proceedings of Euroscipy, last year. http://arxiv.org/pdf/1412.6367.pdf
If I read your paper and code properly, you got 5x faster, mostly because you combined the weighted and unweighted histograms into a single search of the array, and because you used an algorithm that can only be applied to equal- sized bins, similarly to the 10x speed-up Robert was reporting.
I think that having a special path for equal sized bins is a great idea: let's do it, PRs are always welcome! Similarly, getting the counts together with the weights seems like a very good idea.
I also think that writing it in Python is going to take us 80% of the way there: most of the improvements both of you have reported are not likely to be coming from the language chosen, but from the algorithm used. And if C proves to be sufficiently faster to warrant using it, it should be confined to the number crunching: I don;t think there is any point in rewriting argument parsing in C.
Also, keep in mind `np.histogram` can now handle arrays of just about **any** dtype. Handling that complexity in C is not a ride in the park. Other functions like `np.bincount` and `np.digitize` cheat by only handling `double` typed arrays, a luxury that histogram probably can't afford at this point in time.
Jaime
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On 2015/03/23 7:36 AM, Ralf Gommers wrote:
On Mon, Mar 23, 2015 at 2:59 PM, Daniel da Silva <var.mail.daniel@gmail.com <mailto:var.mail.daniel@gmail.com>> wrote:
Hope this isn't too off-topic: but it would be very nice if np.histogram and np.histogram2d supported masked arrays. Is this out of scope for outside the numpy.ma <http://numpy.ma> package?
Right now it looks like there's no histogram function at all for masked arrays - would be good to improve that situation.
If it's as easy as adding to np.histogram something like:
if isinstance(a, np.ma.MaskedArray): a = a.data[~a.mask]
It looks like it requires a little more than that, but not much. For full support a new mask would need to be made from the logical_or of the "a" mask and the weights mask, and then used to compress both "a" and weights. Eric
then it makes sense to add that I think.
Ralf
On Mon, Mar 16, 2015 at 2:35 PM, Robert McGibbon <rmcgibbo@gmail.com <mailto:rmcgibbo@gmail.com>> wrote:
Hi,
It sounds like putting together a PR makes sense then. I'll try hacking on this a bit.
-Robert
On Mar 16, 2015 11:20 AM, "Jaime Fernández del Río" <jaime.frio@gmail.com <mailto:jaime.frio@gmail.com>> wrote:
On Mon, Mar 16, 2015 at 9:28 AM, Jerome Kieffer <Jerome.Kieffer@esrf.fr <mailto:Jerome.Kieffer@esrf.fr>> wrote:
On Mon, 16 Mar 2015 06:56:58 -0700 Jaime Fernández del Río <jaime.frio@gmail.com <mailto:jaime.frio@gmail.com>> wrote:
> Dispatching to a different method seems like a no brainer indeed. The > question is whether we really need to do this in C.
I need to do both unweighted & weighted histograms and we got a factor 5 using (simple) cython: it is in the proceedings of Euroscipy, last year. http://arxiv.org/pdf/1412.6367.pdf
If I read your paper and code properly, you got 5x faster, mostly because you combined the weighted and unweighted histograms into a single search of the array, and because you used an algorithm that can only be applied to equal- sized bins, similarly to the 10x speed-up Robert was reporting.
I think that having a special path for equal sized bins is a great idea: let's do it, PRs are always welcome! Similarly, getting the counts together with the weights seems like a very good idea.
I also think that writing it in Python is going to take us 80% of the way there: most of the improvements both of you have reported are not likely to be coming from the language chosen, but from the algorithm used. And if C proves to be sufficiently faster to warrant using it, it should be confined to the number crunching: I don;t think there is any point in rewriting argument parsing in C.
Also, keep in mind `np.histogram` can now handle arrays of just about **any** dtype. Handling that complexity in C is not a ride in the park. Other functions like `np.bincount` and `np.digitize` cheat by only handling `double` typed arrays, a luxury that histogram probably can't afford at this point in time.
Jaime
-- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Mar 23, 2015 6:59 AM, "Daniel da Silva" <var.mail.daniel@gmail.com> wrote:
Hope this isn't too off-topic: but it would be very nice if np.histogram
and np.histogram2d supported masked arrays. Is this out of scope for outside the numpy.ma package? Usually the way this kind of thing is handled is by adding an np.ma.histogram function. -n
participants (7)
-
Daniel da Silva
-
Eric Firing
-
Jaime Fernández del Río
-
Jerome Kieffer
-
Nathaniel Smith
-
Ralf Gommers
-
Robert McGibbon