Re: [Numpy-discussion] Proposal - extend histograms api to allow uneven bins

11 Feb 2020

      Just a few thoughts re: the changes proposed in
https://github.com/numpy/numpy/pull/14278

1. Though the PR is limited to the 'auto' kwarg, the issue of potential
memory problems for the automated binning methods is a more general one
(e.g. #15332 https://github.com/numpy/numpy/issues/15332).

2. The main concern that jumps out to me is downstream users who are
relying on the implicit assumption of regular binning. This is of course
bad practice and makes even less sense when using one of the bin
estimators, so I'm not sure how big of a concern it is. However, there is
likely downstream user code that relies on the regular binning assumption,
especially since, as far as I know, NumPy has never implemented binning
techniques that return irregular bins.

3. The astropy project have at least one estimator that returns irregular
bins https://docs.astropy.org/en/stable/visualization/histogram.html#.  I
checked for issues
https://github.com/astropy/astropy/issues?utf8=%E2%9C%93&q=is%3Aissue+histogram
related to irregular binning: though they have many of the same problems
with the automatic bin estimators (i.e. memory problems for inputs with
outliers), I didn't see anything specifically related to irregular binning

I just wanted to add my two cents. The binning-data-with-outliers problem
is very common in high-resolution spectroscopy, and I have seen
practitioners rely on the assumption of regular binning (e.g. divide the
`range` by the number of bins) to specify bin centers even though this is
not the right way to do things.

Thanks for taking the time to write up your work!

On Mon, Feb 10, 2020 at 10:53 PM 
wrote:
...
Send NumPy-Discussion mailing list submissions to
        numpy-discussion@python.org
To subscribe or unsubscribe via the World Wide Web, visit
        https://mail.python.org/mailman/listinfo/numpy-discussion
or, via email, send a message with subject or body 'help' to
        numpy-discussion-request@python.org
You can reach the person managing the list at
        numpy-discussion-owner@python.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of NumPy-Discussion digest..."
Today's Topics:
1. Proposal - extend histograms api to allow uneven bins
      (Alexander Reeves)
   2. Re: NEP 38 - Universal SIMD intrinsics (Ralf Gommers)
   3. Re: NEP 38 - Universal SIMD intrinsics (Matti Picus)
----------------------------------------------------------------------
Message: 1
Date: Mon, 10 Feb 2020 18:07:40 -0800
From: Alexander Reeves 
To: numpy-discussion@python.org
Subject: [Numpy-discussion] Proposal - extend histograms api to allow
        uneven bins
Message-ID:

Content-Type: text/plain; charset="utf-8"
Greetings,
I have a PR that warrants discussion according to @seberg. See
https://github.com/numpy/numpy/pull/14278.
It is an enhancement that fixes a bug. The original bug is that when using
the fd estimator on a dataset with small inter-quartile range and large
outliers, the current codebase produces more bins than memory allows. There
are several related bug reports (see #11879, #10297, #8203).
In terms of scope, I restricted my changes to conditions where
np.histogram(bins='auto') defaults to the 'fd'.  For the actual fix, I
actually enhanced the API. I used a suggestion from @eric-wieser to merge
empty histogram bins. In practice this solves the outsized bins issue.
However @seberg is concerned that extending the API in this way may not be
the way to go. For example, if you use "auto" once, and then re-use the
bins, the uneven bins may not be what you want.
Furthermore @eric-wieser is concerned that there may be a floating-point
devil in the details. He advocates using the hypothesis testing package to
increase our confidence that the current implementation adequately handles
corner cases.
I would like to do my part in improving the code base. I don't have strong
opinions but I have to admit that I would like to eventually make a PR that
resolves these bugs. This has been a PR half a year in the making after
all.
Thoughts?
-areeves87

Re: [Numpy-discussion] Proposal - extend histograms api to allow uneven bins

Ross Barnowski