[Numpy-discussion] ENH: faster histograms (PR #9627)

Sat Sep 9 12:34:28 EDT 2017

I have received and addressed many great suggestions and critiques from
@juliantaylor <https://github.com/juliantaylor> and @eric-wieser
<https://github.com/eric-wieser> for pull request #9627
<https://github.com/numpy/numpy/pull/9627>  which moves the np.histogram()
and np.histogramdd() methods into C. Speed ups of 2x to 20x were realized
for large sample data depending on the percentage of sample points that lay
outside the histogramming range. For details more see my report here
<https://gist.github.com/theodoregoetz/10d2351421689bf2660b4f2fca350e6e>.

I'd like to know now how to proceed with this pull request. I.e., how can I
move the process along.

Additionally, I'd like to propose a new feature which I'm sure requires
some discussion:

The inspiration for this effort came from the fast-histogram
<https://pypi.python.org/pypi/fast-histogram> python package which is still
faster because it ignores ULP-level correctness. Towards the bottom of my
report, I suggest adding a new option to the histogramming methods to
ignore ULP corrections which would make the numpy implementation on-par
with fast-histogram's. Something like:

    np.histogram(sample, bins=10, range=(0, 10), fast=True)

which would raise an exception or ignore the "fast" parameter perhaps if
bins were given as a list of edges:

    np.histogram(sample, bins=[0,1,2,3], fast=True)  # not fast.

I think I'd shy away from testing the bin-uniformity since it is very hard
to do without a specified tolerance. This can be done by the user with
something like this:

    np.all(np.abs(np.diff(np.diff(edges))) <= \
2**6 * np.finfo(edges.dtype).eps)

Or by comparison with the output of np.linspace().
--
Johann.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20170909/9cc13332/attachment.html>