[Numpy-discussion] numpy histogram normed=True (bug / confusing behavior)

josef.pktd at gmail.com josef.pktd at gmail.com
Mon Aug 30 16:30:26 EDT 2010


On Mon, Aug 30, 2010 at 3:44 PM, David Huard <david.huard at gmail.com> wrote:
>
>
> On Mon, Aug 30, 2010 at 3:02 PM, <josef.pktd at gmail.com> wrote:
>>
>> On Mon, Aug 30, 2010 at 2:43 PM, Benjamin Root <ben.root at ou.edu> wrote:
>> > On Mon, Aug 30, 2010 at 10:50 AM, <josef.pktd at gmail.com> wrote:
>> >>
>> >> On Mon, Aug 30, 2010 at 11:39 AM, Bruce Southey <bsouthey at gmail.com>
>> >> wrote:
>> >> > On 08/30/2010 09:19 AM, Benjamin Root wrote:
>> >> >
>> >> > On Mon, Aug 30, 2010 at 8:29 AM, David Huard <david.huard at gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Thanks for the feedback,
>> >> >> As far as I understand it, the proposition is to keep histogram as
>> >> >> it
>> >> >> is
>> >> >> for 1.5, then in 2.0, deprecate normed=True but keep the buggy
>> >> >> behavior,
>> >> >> while adding a density keyword that fixes the bug. In a later
>> >> >> release,
>> >> >> we
>> >> >> could then get rid of normed. While the bug won't be present in
>> >> >> histogramdd
>> >> >> and histogram2d, the keyword change should be mirrored in those
>> >> >> functions as
>> >> >> well.
>> >> >> I personally am not too keen on changing the keyword normed for
>> >> >> density. I
>> >> >> feel we are trading clarity for a few new users against additional
>> >> >> trouble
>> >> >> for many existing users. We could mitigate this by first documenting
>> >> >> the
>> >> >> change in the docstring and live with both keywords for a few years
>> >> >> before
>> >> >> raising a DeprecationWarning.
>> >> >> Since this has a direct impact on matloblib's hist, I'd be keen to
>> >> >> hears
>> >> >> the devs on this.
>> >> >> David
>> >> >
>> >> > I am not a dev, but I would like to give a word of warning from
>> >> > matplotlib.
>> >> >
>> >> > In matplotlib, the bar/hist family of functions grew organically as
>> >> > the
>> >> > devs
>> >> > took on various requests to add keywords and such to modify the style
>> >> > and
>> >> > behavior of those graphing functions.  It has now become an
>> >> > unmaintainable
>> >> > mess, prompting discussions on how to rip it out and replace it with
>> >> > a
>> >> > cleaner implementation.  While everyone agrees that it needs to be
>> >> > done,
>> >> > we
>> >> > all don't want to break backwards compatibility.
>> >> >
>> >> > My personal feeling is that a function should do one thing, and do
>> >> > that
>> >> > one
>> >> > thing well.  So, to me, that means that histogram() should return an
>> >> > array
>> >> > of counts and the bins for those counts.  Anything more is merely
>> >> > window
>> >> > dressing to me.  With this information, one can easily compute a
>> >> > cumulative
>> >> > distribution function, and/or normalize the result.  The idea is that
>> >> > if
>> >> > there is nothing special that needs to be done within the histogram
>> >> > algorithm to accommodate these extra features, then they belong
>> >> > outside
>> >> > the
>> >> > function.
>> >> >
>> >> > My 2 cents,
>> >> > Ben Root
>> >> >
>> >> > _______________________________________________
>> >> > NumPy-Discussion mailing list
>> >> > NumPy-Discussion at scipy.org
>> >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >> >
>> >> > +1 for Ben's approach.
>> >> > This is very similar to my view regarding to the contingency table
>> >> > class
>> >> > proposed for scipy ( http://projects.scipy.org/scipy/ticket/1258). We
>> >> > need
>> >> > to provide the core functionality that other approaches such as
>> >> > density
>> >> > estimation can use but not be limited to specific details.
>> >>
>> >> I think (a corrected) density histogram is core functionality for
>> >> unequal bin lengths.
>> >>
>> >> The graph with raw count in the case of unequal bin sizes would be
>> >> quite misleading when plotted and interpreted on the real line and not
>> >> on discrete points (shaded areas instead of vertical lines). And as
>> >> the origin of this thread showed, it's not trivial to figure out what
>> >> the correct normalization is.
>> >> So, I think, if we drop the density normalization, we just need a new
>> >> function that does it.
>> >>
>> >> My 2c,
>> >>
>> >> Josef
>> >>
>> >>
>> >
>> > Why not a function that takes the output of a core histogram and
>> > produces a
>> > correct density normalization?  Such a function would be useful
>> > elsewhere, I
>> > imagine.
>> >
>> > Of course there is a lot of legacy issues to consider, but if we
>> > introduce
>> > such a function first with documentation in histogram() showing how to
>> > produce a normalized density, we can then keep some of the bad code for
>> > now
>> > for backwards compatibility with notes saying that some of the stuff
>> > will be
>> > deprecated.  Especially point out in the docs where the current code
>> > fails
>> > to produce the correct results.
>>
>> bugfix or redesign ?
>>
>> My feature request for (or target for forking) the histogram functions
>> is to get the temporary results out, or get additional results, for
>> example the bin-number or quantization for each observation, or some
>> other things that I don't remember right now.
>>
>> With histogram functions that only do histograms, we loose a lot of
>> calculations. This is, however, not really relevant for calculating
>> densities since the bin edges are returned.
>>
>
> Not sure I'm understanding what you mean by this, but if you look at the
> code, you'll see that histogram is basically a big wrapper around a
> one-liner: np.diff(np.searchsorted(np.sort(data), bins)). Most of the code
> is there to make this one-liner user-friendly, improve performance or handle
> weights.

Maybe it only applies to histogramdd. I tried to take it apart to see
how it works after a discussion on the numpy mailing list, "2d binning
and linear regression" on June 20th.

I haven't looked at 1D histogram in a while, and it's easy to get a
(slower) replacement for it. From a quick look at histogram with
weights I cannot figure out if it's possible to recover the bin
assignment of an observation as a byproduct

Josef

> I just added a warning alerting concerned users (r8674), so this takes care
> of the bug fix and Nils wish to avoid a silent change in behavior. These two
> changes could be included in 1.5 if Ralf feels this is worthwhile.
> Cheers,
> David H.
>
>>
>> Josef
>>
>>
>> >
>> > Ben Root
>> >
>> > _______________________________________________
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion at scipy.org
>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>> >
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>



More information about the NumPy-Discussion mailing list