[Numpy-discussion] Proposal - extend histograms api to allow uneven bins

Alexander Reeves lxndr.rvs at gmail.com
Mon Feb 10 21:07:40 EST 2020


Greetings,

I have a PR that warrants discussion according to @seberg. See
https://github.com/numpy/numpy/pull/14278.

It is an enhancement that fixes a bug. The original bug is that when using
the fd estimator on a dataset with small inter-quartile range and large
outliers, the current codebase produces more bins than memory allows. There
are several related bug reports (see #11879, #10297, #8203).

In terms of scope, I restricted my changes to conditions where
np.histogram(bins='auto') defaults to the 'fd'.  For the actual fix, I
actually enhanced the API. I used a suggestion from @eric-wieser to merge
empty histogram bins. In practice this solves the outsized bins issue.

However @seberg is concerned that extending the API in this way may not be
the way to go. For example, if you use "auto" once, and then re-use the
bins, the uneven bins may not be what you want.

Furthermore @eric-wieser is concerned that there may be a floating-point
devil in the details. He advocates using the hypothesis testing package to
increase our confidence that the current implementation adequately handles
corner cases.

I would like to do my part in improving the code base. I don't have strong
opinions but I have to admit that I would like to eventually make a PR that
resolves these bugs. This has been a PR half a year in the making after all.

Thoughts?

-areeves87
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20200210/69a4fab1/attachment.html>


More information about the NumPy-Discussion mailing list