[Numpy-discussion] Histogram does not preserve subclasses of ndarray (e.g. masked arrays)

josef.pktd at gmail.com josef.pktd at gmail.com
Thu Sep 2 18:31:14 EDT 2010


On Thu, Sep 2, 2010 at 3:50 PM, Joe Kington <jkington at wisc.edu> wrote:
> Hi all,
>
> I just wanted to check if this would be considered a bug.
>
> numpy.histogram does not appear to preserve subclasses of ndarrays (e.g.
> masked arrays).  This leads to considerable problems when working with
> masked arrays. (As per this Stack Overflow question)
>
> E.g.
>
> import numpy as np
> x = np.arange(100)
> x = np.ma.masked_where(x > 30, x)
>
> counts, bin_edges = np.histogram(x)
>
> yields:
> counts --> array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10])
> bin_edges --> array([  0. ,   9.9,  19.8,  29.7,  39.6,  49.5,  59.4,
> 69.3,  79.2, 89.1,  99. ])
>
> I would have expected histogram to ignore the masked portion of the data.
> Is this a bug, or expected behavior?  I'll open a bug report, if it's not
> expected behavior...

If you want to ignore masked data it's just on extra function call

histogram(m_arr.compressed())

I don't think the fact that this makes an extra copy will be relevant,
because I guess full masked array handling inside histogram will be a
lot more expensive.

Using asanyarray would also allow matrices in and other subtypes that
might not be handled correctly by the histogram calculations.

For anything else besides dropping masked observations, it would be
necessary to figure out what the masked array definition of a
histogram is, as Bruce pointed out.

(Another interesting question would be if histogram handles nans
correctly, searchsorted ???)

Josef

>
> This would appear to be easily fixed by using asanyarray rather than asarray
> within histogram.  E.g. this diff for numpy/lib/function_base.py
> Index: function_base.py
> ===================================================================
> --- function_base.py    (revision 8604)
> +++ function_base.py    (working copy)
> @@ -132,9 +132,9 @@
>
>      """
>
> -    a = asarray(a)
> +    a = asanyarray(a)
>      if weights is not None:
> -        weights = asarray(weights)
> +        weights = asanyarray(weights)
>          if np.any(weights.shape != a.shape):
>              raise ValueError(
>                      'weights should have the same shape as a.')
> @@ -156,7 +156,7 @@
>              mx += 0.5
>          bins = linspace(mn, mx, bins+1, endpoint=True)
>      else:
> -        bins = asarray(bins)
> +        bins = asanyarray(bins)
>          if (np.diff(bins) < 0).any():
>              raise AttributeError(
>                      'bins must increase monotonically.')
>
> Thanks!
> -Joe
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>



More information about the NumPy-Discussion mailing list