[Numpy-discussion] numpy histogram normed=True (bug / confusing behavior)

Benjamin Root ben.root at ou.edu
Mon Aug 30 14:43:07 EDT 2010


On Mon, Aug 30, 2010 at 10:50 AM, <josef.pktd at gmail.com> wrote:

> On Mon, Aug 30, 2010 at 11:39 AM, Bruce Southey <bsouthey at gmail.com>
> wrote:
> > On 08/30/2010 09:19 AM, Benjamin Root wrote:
> >
> > On Mon, Aug 30, 2010 at 8:29 AM, David Huard <david.huard at gmail.com>
> wrote:
> >>
> >> Thanks for the feedback,
> >> As far as I understand it, the proposition is to keep histogram as it is
> >> for 1.5, then in 2.0, deprecate normed=True but keep the buggy behavior,
> >> while adding a density keyword that fixes the bug. In a later release,
> we
> >> could then get rid of normed. While the bug won't be present in
> histogramdd
> >> and histogram2d, the keyword change should be mirrored in those
> functions as
> >> well.
> >> I personally am not too keen on changing the keyword normed for density.
> I
> >> feel we are trading clarity for a few new users against additional
> trouble
> >> for many existing users. We could mitigate this by first documenting the
> >> change in the docstring and live with both keywords for a few years
> before
> >> raising a DeprecationWarning.
> >> Since this has a direct impact on matloblib's hist, I'd be keen to hears
> >> the devs on this.
> >> David
> >
> > I am not a dev, but I would like to give a word of warning from
> matplotlib.
> >
> > In matplotlib, the bar/hist family of functions grew organically as the
> devs
> > took on various requests to add keywords and such to modify the style and
> > behavior of those graphing functions.  It has now become an
> unmaintainable
> > mess, prompting discussions on how to rip it out and replace it with a
> > cleaner implementation.  While everyone agrees that it needs to be done,
> we
> > all don't want to break backwards compatibility.
> >
> > My personal feeling is that a function should do one thing, and do that
> one
> > thing well.  So, to me, that means that histogram() should return an
> array
> > of counts and the bins for those counts.  Anything more is merely window
> > dressing to me.  With this information, one can easily compute a
> cumulative
> > distribution function, and/or normalize the result.  The idea is that if
> > there is nothing special that needs to be done within the histogram
> > algorithm to accommodate these extra features, then they belong outside
> the
> > function.
> >
> > My 2 cents,
> > Ben Root
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> > +1 for Ben's approach.
> > This is very similar to my view regarding to the contingency table class
> > proposed for scipy ( http://projects.scipy.org/scipy/ticket/1258). We
> need
> > to provide the core functionality that other approaches such as density
> > estimation can use but not be limited to specific details.
>
> I think (a corrected) density histogram is core functionality for
> unequal bin lengths.
>
> The graph with raw count in the case of unequal bin sizes would be
> quite misleading when plotted and interpreted on the real line and not
> on discrete points (shaded areas instead of vertical lines). And as
> the origin of this thread showed, it's not trivial to figure out what
> the correct normalization is.
> So, I think, if we drop the density normalization, we just need a new
> function that does it.
>
> My 2c,
>
> Josef
>
>
>
Why not a function that takes the output of a core histogram and produces a
correct density normalization?  Such a function would be useful elsewhere, I
imagine.

Of course there is a lot of legacy issues to consider, but if we introduce
such a function first with documentation in histogram() showing how to
produce a normalized density, we can then keep some of the bad code for now
for backwards compatibility with notes saying that some of the stuff will be
deprecated.  Especially point out in the docs where the current code fails
to produce the correct results.

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20100830/2ba0ac08/attachment.html>


More information about the NumPy-Discussion mailing list