[Numpy-discussion] Ticket #605 Incorrect behavior of numpy.histogram

David Huard david.huard at gmail.com
Tue Apr 8 09:12:06 EDT 2008


Hans,

Note that the current histogram is buggy, in the sense that it assumes that
all bins have the same width and computes db = bins[1]-bin[0]. This is why
you get zeros everywhere.

The current behavior has been heavily criticized and I think we should
change it. My proposal is to have for histogram the same behavior as for
histogramdd and histogram2d: bins are the bin edges, including the rightmost
bin, and values outside of the bins are not tallied. The problem with this
is that it breaks code, and I'm not sure it's such a good idea to do this in
a point release.

My short term proposal would be to fix the normalization bug and document
the current behavior of histogram for the 1.0.5 release. Once it's done, we
can modify histogram and maybe print a warning the first time it's used to
notice users of the change.

I'd like to hear the voice of experienced devs on this. This issue has been
raised a number of times since I follow this ML. It's not the first time
I've proposed patches, and I've already documented the weird behavior only
to see the comments disappear after a while. I hope this time some kind of
agreement will be reached.

Regards,

David




2008/4/8, Hans Meine <meine at informatik.uni-hamburg.de>:
>
> Am Montag, 07. April 2008 14:34:08 schrieb Hans Meine:
>
> > Am Samstag, 05. April 2008 21:54:27 schrieb Anne Archibald:
> > > There's also a fourth option - raise an exception if any points are
> > > outside the range.
> >
> > +1
> >
> > I think this should be the default.  Otherwise, I tend towards
> "exclude",
> > in order to have comparable bin sizes (when plotting, I always find
> peaks
> > at the ends annoying); this could also be called "clip" BTW.
> >
> > But really, an exception would follow the Zen: "In the face of
> ambiguity,
> > refuse the temptation to guess."  And with a kwarg: "Explicit is better
> > than implicit."
>
>
> When posting this, I did indeed not think this through fully; as David
> (and
> Tommy) pointed out, this API does not fit well with the existing `bins`
> option, especially when a sequence of bin bounds is given.  (I guess I was
> mostly thinking about the special case of discrete values and 1:1 bins, as
> typical for uint8 data.)
>
> Thus, I would like to withdraw my above opinion from and instead state
> that I
> find the current API as clear as it gets.  If you want to exclude values,
> simply pass an additional right bound, and for including outliers,
> passing -inf as additional left bound seems to do the trick.  This could
> be
> possibly added to the documentation though.
>
> The only critical aspect I see is the `normed` arg.  As it is now, the
> rightmost bin has always infinite size, but it is not treated like that:
>
> In [1]: from numpy import *
>
> In [2]: histogram(arange(10), [2,3,4], normed = True)
> Out[2]: (array([ 0.1,  0.1,  0.6]), array([2, 3, 4]))
>
> Even worse, if you try to add an infinite bin to the left, this pulls all
> values to zero (technically, I understand that, but it looks really
> undesirable to me):
>
> In [3]: histogram(arange(10), [-inf, 2,3,4], normed = True)
> Out[3]: (array([ 0.,  0.,  0.,  0.]), array([-Inf,   2.,   3.,   4.]))
>
>
> --
> Ciao, /  /
>      /--/
>     /  / ANS
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20080408/ee32f94a/attachment.html>


More information about the NumPy-Discussion mailing list