Just a few thoughts re: the changes proposed in
https://github.com/numpy/numpy/pull/14278
1. Though the PR is limited to the 'auto' kwarg, the issue of potential
memory problems for the automated binning methods is a more general one
(e.g. #15332 https://github.com/numpy/numpy/issues/15332).
2. The main concern that jumps out to me is downstream users who are
relying on the implicit assumption of regular binning. This is of course
bad practice and makes even less sense when using one of the bin
estimators, so I'm not sure how big of a concern it is. However, there is
likely downstream user code that relies on the regular binning assumption,
especially since, as far as I know, NumPy has never implemented binning
techniques that return irregular bins.
3. The astropy project have at least one estimator that returns irregular
bins https://docs.astropy.org/en/stable/visualization/histogram.html#. I
checked for issues
https://github.com/astropy/astropy/issues?utf8=%E2%9C%93&q=is%3Aissue+histogram
related to irregular binning: though they have many of the same problems
with the automatic bin estimators (i.e. memory problems for inputs with
outliers), I didn't see anything specifically related to irregular binning
I just wanted to add my two cents. The binning-data-with-outliers problem
is very common in high-resolution spectroscopy, and I have seen
practitioners rely on the assumption of regular binning (e.g. divide the
`range` by the number of bins) to specify bin centers even though this is
not the right way to do things.
Thanks for taking the time to write up your work!
On Mon, Feb 10, 2020 at 10:53 PM
Send NumPy-Discussion mailing list submissions to numpy-discussion@python.org
To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/numpy-discussion or, via email, send a message with subject or body 'help' to numpy-discussion-request@python.org
You can reach the person managing the list at numpy-discussion-owner@python.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of NumPy-Discussion digest..."
Today's Topics:
1. Proposal - extend histograms api to allow uneven bins (Alexander Reeves) 2. Re: NEP 38 - Universal SIMD intrinsics (Ralf Gommers) 3. Re: NEP 38 - Universal SIMD intrinsics (Matti Picus)
----------------------------------------------------------------------
Message: 1 Date: Mon, 10 Feb 2020 18:07:40 -0800 From: Alexander Reeves
To: numpy-discussion@python.org Subject: [Numpy-discussion] Proposal - extend histograms api to allow uneven bins Message-ID: Content-Type: text/plain; charset="utf-8" Greetings,
I have a PR that warrants discussion according to @seberg. See https://github.com/numpy/numpy/pull/14278.
It is an enhancement that fixes a bug. The original bug is that when using the fd estimator on a dataset with small inter-quartile range and large outliers, the current codebase produces more bins than memory allows. There are several related bug reports (see #11879, #10297, #8203).
In terms of scope, I restricted my changes to conditions where np.histogram(bins='auto') defaults to the 'fd'. For the actual fix, I actually enhanced the API. I used a suggestion from @eric-wieser to merge empty histogram bins. In practice this solves the outsized bins issue.
However @seberg is concerned that extending the API in this way may not be the way to go. For example, if you use "auto" once, and then re-use the bins, the uneven bins may not be what you want.
Furthermore @eric-wieser is concerned that there may be a floating-point devil in the details. He advocates using the hypothesis testing package to increase our confidence that the current implementation adequately handles corner cases.
I would like to do my part in improving the code base. I don't have strong opinions but I have to admit that I would like to eventually make a PR that resolves these bugs. This has been a PR half a year in the making after all.
Thoughts?
-areeves87