ENH: faster histograms (PR #9627)

I have received and addressed many great suggestions and critiques from @juliantaylor <https://github.com/juliantaylor> and @eric-wieser <https://github.com/eric-wieser> for pull request #9627 <https://github.com/numpy/numpy/pull/9627> which moves the np.histogram() and np.histogramdd() methods into C. Speed ups of 2x to 20x were realized for large sample data depending on the percentage of sample points that lay outside the histogramming range. For details more see my report here <https://gist.github.com/theodoregoetz/10d2351421689bf2660b4f2fca350e6e>. I'd like to know now how to proceed with this pull request. I.e., how can I move the process along. Additionally, I'd like to propose a new feature which I'm sure requires some discussion: The inspiration for this effort came from the fast-histogram <https://pypi.python.org/pypi/fast-histogram> python package which is still faster because it ignores ULP-level correctness. Towards the bottom of my report, I suggest adding a new option to the histogramming methods to ignore ULP corrections which would make the numpy implementation on-par with fast-histogram's. Something like: np.histogram(sample, bins=10, range=(0, 10), fast=True) which would raise an exception or ignore the "fast" parameter perhaps if bins were given as a list of edges: np.histogram(sample, bins=[0,1,2,3], fast=True) # not fast. I think I'd shy away from testing the bin-uniformity since it is very hard to do without a specified tolerance. This can be done by the user with something like this: np.all(np.abs(np.diff(np.diff(edges))) <= \ 2**6 * np.finfo(edges.dtype).eps) Or by comparison with the output of np.linspace(). -- Johann.
participants (1)
-
Johann Goetz