Re: [Numpy-discussion] Proposal - extend histograms api to allow uneven bins
Just a few thoughts re: the changes proposed in https://github.com/numpy/numpy/pull/14278 1. Though the PR is limited to the 'auto' kwarg, the issue of potential memory problems for the automated binning methods is a more general one (e.g. #15332 <https://github.com/numpy/numpy/issues/15332>). 2. The main concern that jumps out to me is downstream users who are relying on the implicit assumption of regular binning. This is of course bad practice and makes even less sense when using one of the bin estimators, so I'm not sure how big of a concern it is. However, there is likely downstream user code that relies on the regular binning assumption, especially since, as far as I know, NumPy has never implemented binning techniques that return irregular bins. 3. The astropy project have at least one estimator that returns irregular bins <https://docs.astropy.org/en/stable/visualization/histogram.html#>. I checked for issues <https://github.com/astropy/astropy/issues?utf8=%E2%9C%93&q=is%3Aissue+histogram> related to irregular binning: though they have many of the same problems with the automatic bin estimators (i.e. memory problems for inputs with outliers), I didn't see anything specifically related to irregular binning I just wanted to add my two cents. The binning-data-with-outliers problem is very common in high-resolution spectroscopy, and I have seen practitioners rely on the assumption of regular binning (e.g. divide the `range` by the number of bins) to specify bin centers even though this is not the right way to do things. Thanks for taking the time to write up your work! On Mon, Feb 10, 2020 at 10:53 PM <numpy-discussion-request@python.org> wrote:
Send NumPy-Discussion mailing list submissions to numpy-discussion@python.org
To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/numpy-discussion or, via email, send a message with subject or body 'help' to numpy-discussion-request@python.org
You can reach the person managing the list at numpy-discussion-owner@python.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of NumPy-Discussion digest..."
Today's Topics:
1. Proposal - extend histograms api to allow uneven bins (Alexander Reeves) 2. Re: NEP 38 - Universal SIMD intrinsics (Ralf Gommers) 3. Re: NEP 38 - Universal SIMD intrinsics (Matti Picus)
----------------------------------------------------------------------
Message: 1 Date: Mon, 10 Feb 2020 18:07:40 -0800 From: Alexander Reeves <lxndr.rvs@gmail.com> To: numpy-discussion@python.org Subject: [Numpy-discussion] Proposal - extend histograms api to allow uneven bins Message-ID: <CABeAeRyiVt4RJP2ew7= C4c4itJO3WZptdD6yrhWV6NgUoqT_mQ@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
Greetings,
I have a PR that warrants discussion according to @seberg. See https://github.com/numpy/numpy/pull/14278.
It is an enhancement that fixes a bug. The original bug is that when using the fd estimator on a dataset with small inter-quartile range and large outliers, the current codebase produces more bins than memory allows. There are several related bug reports (see #11879, #10297, #8203).
In terms of scope, I restricted my changes to conditions where np.histogram(bins='auto') defaults to the 'fd'. For the actual fix, I actually enhanced the API. I used a suggestion from @eric-wieser to merge empty histogram bins. In practice this solves the outsized bins issue.
However @seberg is concerned that extending the API in this way may not be the way to go. For example, if you use "auto" once, and then re-use the bins, the uneven bins may not be what you want.
Furthermore @eric-wieser is concerned that there may be a floating-point devil in the details. He advocates using the hypothesis testing package to increase our confidence that the current implementation adequately handles corner cases.
I would like to do my part in improving the code base. I don't have strong opinions but I have to admit that I would like to eventually make a PR that resolves these bugs. This has been a PR half a year in the making after all.
Thoughts?
-areeves87
participants (1)
-
Ross Barnowski