[Numpy-discussion] numpy histogram normed=True (bug / confusing behavior)

Mon Aug 30 13:00:18 EDT 2010

I tend to agree with Josef here,

To me, bincount and digitize are the low-level functions, and histogram
contains a bit more functionality since its used so often and for many use
cases. My guess is that if we removed the normalization, it could annoy a
lot of people and would quickly appear on the desired feature list.

Just to put things in perspective, this was indeed a trivial bug that
required a one line fix. It only affected use cases with non-uniform bin
widths and normed=True, a combination that is probably uncommon. I believe
it is a genuine bug, not just a confusing behavior, and that's why I
initially thought a warning was unnecessary.

In any case, I'm not sure this is really a "while we're at it" situation,
that is, I think the switch from "normed" to "density" should be addressed
in another context. That would allow us to include the bug fix (with a
warning) in the upcoming 1.5 release.

David H.

On Mon, Aug 30, 2010 at 11:50 AM, <josef.pktd at gmail.com> wrote:

> On Mon, Aug 30, 2010 at 11:39 AM, Bruce Southey <bsouthey at gmail.com>
> wrote:
> > On 08/30/2010 09:19 AM, Benjamin Root wrote:
> >
> > On Mon, Aug 30, 2010 at 8:29 AM, David Huard <david.huard at gmail.com>
> wrote:
> >>
> >> Thanks for the feedback,
> >> As far as I understand it, the proposition is to keep histogram as it is
> >> for 1.5, then in 2.0, deprecate normed=True but keep the buggy behavior,
> >> while adding a density keyword that fixes the bug. In a later release,
> we
> >> could then get rid of normed. While the bug won't be present in
> histogramdd
> >> and histogram2d, the keyword change should be mirrored in those
> functions as
> >> well.
> >> I personally am not too keen on changing the keyword normed for density.
> I
> >> feel we are trading clarity for a few new users against additional
> trouble
> >> for many existing users. We could mitigate this by first documenting the
> >> change in the docstring and live with both keywords for a few years
> before
> >> raising a DeprecationWarning.
> >> Since this has a direct impact on matloblib's hist, I'd be keen to hears
> >> the devs on this.
> >> David
> >
> > I am not a dev, but I would like to give a word of warning from
> matplotlib.
> >
> > In matplotlib, the bar/hist family of functions grew organically as the
> devs
> > took on various requests to add keywords and such to modify the style and
> > behavior of those graphing functions.  It has now become an
> unmaintainable
> > mess, prompting discussions on how to rip it out and replace it with a
> > cleaner implementation.  While everyone agrees that it needs to be done,
> we
> > all don't want to break backwards compatibility.
> >
> > My personal feeling is that a function should do one thing, and do that
> one
> > thing well.  So, to me, that means that histogram() should return an
> array
> > of counts and the bins for those counts.  Anything more is merely window
> > dressing to me.  With this information, one can easily compute a
> cumulative
> > distribution function, and/or normalize the result.  The idea is that if
> > there is nothing special that needs to be done within the histogram
> > algorithm to accommodate these extra features, then they belong outside
> the
> > function.
> >
> > My 2 cents,
> > Ben Root
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> > +1 for Ben's approach.
> > This is very similar to my view regarding to the contingency table class
> > proposed for scipy ( http://projects.scipy.org/scipy/ticket/1258). We
> need
> > to provide the core functionality that other approaches such as density
> > estimation can use but not be limited to specific details.
>
> I think (a corrected) density histogram is core functionality for
> unequal bin lengths.
>
> The graph with raw count in the case of unequal bin sizes would be
> quite misleading when plotted and interpreted on the real line and not
> on discrete points (shaded areas instead of vertical lines). And as
> the origin of this thread showed, it's not trivial to figure out what
> the correct normalization is.
> So, I think, if we drop the density normalization, we just need a new
> function that does it.
>
> My 2c,
>
> Josef
>
>
> >
> > Bruce
> >
> >
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20100830/ede2e717/attachment.html>