Okay, here's my attempt at an *uncontroversial* email!
Specifically, I think it'll be easier to talk about this NA stuff if
we can establish some common ground, and easier for people to follow
if the basic points of agreement are laid out in one place. So I'm
going to try and summarize just the things that we can agree about.
Note that right now I'm *only* talking about what kind of tools we
want to give the user -- i.e., what kind of problems we are trying to
solve. AFAICT we don't have as much consensus on implementation
matters, and anyway it's hard to make implementation decisions without
knowing what we're trying to accomplish.
1) I think we have consensus that there are (at least) two different
possible ways of thinking about this problem, with somewhat different
constituencies. Let's call these two concepts "MISSING data" and
2) I also think we have at least a rough consensus on what these
concepts mean, and what their supporters want from them:
- Conceptually, MISSINGness acts like a property of a datum --
assigning MISSING to a location is like assigning any other value to
- Ufuncs and other operations must propagate these values by default,
and there must be an option to cause them to be ignored
- Must be competitive with NaNs in terms of speed and memory usage (or
else people will just use NaNs)
- Compatibility with R is valuable
- To avoid user confusion, ideally it should *not* be possible to
'unmask' a missing value, since this is inconsistent with the "missing
value" metaphor (e.g., see Wes's comment about "leaky abstractions")
- Possible useful extension: having different classes of missing
values (similar to Stata)
- Target audience: data analysis with missing data, neuroimaging,
econometrics, former R users, ...
- Conceptually, IGNOREDness acts like a property of the array --
toggling a location to be IGNORED is kind of vaguely similar to
changing an array's shape
- Ufuncs and other operations must ignore these values by default, and
there doesn't really need to be a way to propagate them, even as an
option (though it probably wouldn't hurt either)
- Some memory overhead is inevitable and acceptable
- Compatibility with R neither possible nor valuable
- Ability to toggle the IGNORED state of a location is critical, and
should be as convenient as possible
- Possible useful extension: having not just different types of
ignored values, but richer ways to combine them -- e.g., the example
of combining astronomical images with some kind of associated
per-pixel quality scores, where one might want the 'mask' to be not
just a boolean IGNORED/not-IGNORED flag, but an integer (perhaps a
multi-byte integer) or even a float, and to allow these 'masks' to be
combined in some more complex way than just logical_and.
- Target audience: anyone who's already doing this kind of thing by
hand using a second mask array + boolean indexing, former numpy.ma
users, matplotlib, ...
3) And perhaps we can all agree that the biggest *un*resolved question
is whether we want to:
- emphasize the similarities between these two use cases and build a
single interface that can handle both concepts, with some compromises
- or, treat these at two mostly-separate features that can each become
exactly what the respective constituency wants without compromise --
but with some potential redundancy and extra code.
Each approach has advantages and disadvantages.
Does that seem like a fair summary? Anything more we can add? Most
importantly, anything here that you disagree with? Did I summarize
your needs well? Do you have a use case that you feel doesn't fit
naturally into either category?
[Also, I thought this might make the start of a good wiki page for
people to reference during these discussions, but I don't seem to have
edit rights. If other people agree, maybe someone could put it up, or
give me access? My trac id is njs(a)pobox.com.]
I noticed this:
In : np.int32(np.float32(2**31))
In : np.int32(np.float32(2**31))
I assume what is happening is that the casting is handing off to the c
library, and that behavior of the c library differs on these
platforms? Should we expect or hope that this behavior would be the
same across platforms?
Thanks for any pointers,
I am pleased to announce the availability of the first release release of
SciPy 0.10.0. For this release over a 100 tickets and pull requests have
been closed, and many new features have been added. Some of the highlights
- support for Bento as a build system for scipy
- generalized and shift-invert eigenvalue problems in sparse.linalg
- addition of discrete-time linear systems in the signal module
Sources and binaries can be found at
http://sourceforge.net/projects/scipy/files/scipy/0.10.0rc1/, release notes
are copied below.
Please try this release and report problems on the mailing list.
Note: one problem with Python 2.5 (syntax) was discovered after tagging the
release, it's fixed in the 0.10.x branch already so no need to report that
SciPy 0.10.0 Release Notes
.. note:: Scipy 0.10.0 is not released yet!
SciPy 0.10.0 is the culmination of 8 months of hard work. It contains
many new features, numerous bug-fixes, improved test coverage and
better documentation. There have been a limited number of deprecations
and backwards-incompatible changes in this release, which are documented
below. All users are encouraged to upgrade to this release, as there
are a large number of bug-fixes and optimizations. Moreover, our
development attention will now shift to bug-fix releases on the 0.10.x
branch, and on adding new features on the development master branch.
- Support for Bento as optional build system.
- Support for generalized eigenvalue problems, and all shift-invert modes
available in ARPACK.
This release requires Python 2.4-2.7 or 3.1- and NumPy 1.5 or greater.
Bento: new optional build system
Scipy can now be built with `Bento <http://cournape.github.com/Bento/>`_.
Bento has some nice features like parallel builds and partial rebuilds, that
are not possible with the default build system (distutils). For usage
instructions see BENTO_BUILD.txt in the scipy top-level directory.
Currently Scipy has three build systems, distutils, numscons and bento.
Numscons is deprecated and is planned and will likely be removed in the next
Generalized and shift-invert eigenvalue problems in ``scipy.sparse.linalg``
The sparse eigenvalue problem solver functions
``scipy.sparse.eigs/eigh`` now support generalized eigenvalue
problems, and all shift-invert modes available in ARPACK.
Discrete-Time Linear Systems (``scipy.signal``)
Support for simulating discrete-time linear systems, including
``scipy.signal.dlsim``, ``scipy.signal.dimpulse``, and
has been added to SciPy. Conversion of linear systems from continuous-time
discrete-time representations is also present via the
Enhancements to ``scipy.signal``
A Lomb-Scargle periodogram can now be computed with the new function
The forward-backward filter function ``scipy.signal.filtfilt`` can now
filter the data in a given axis of an n-dimensional numpy array.
(Previously it only handled a 1-dimensional array.) Options have been
added to allow more control over how the data is extended before filtering.
FIR filter design with ``scipy.signal.firwin2`` now has options to create
filters of type III (zero at zero and Nyquist frequencies) and IV (zero at
Additional decomposition options (``scipy.linalg``)
A sort keyword has been added to the Schur decomposition routine
(``scipy.linalg.schur``) to allow the sorting of eigenvalues in
the resultant Schur form.
Additional special matrices (``scipy.linalg``)
The functions ``hilbert`` and ``invhilbert`` were added to ``scipy.linalg``.
Enhancements to ``scipy.stats``
* The *one-sided form* of Fisher's exact test is now also implemented in
* The function ``stats.chi2_contingency`` for computing the chi-square test
independence of factors in a contingency table has been added, along with
the related utility functions ``stats.contingency.margins`` and
Basic support for Harwell-Boeing file format for sparse matrices
Both read and write are support through a simple function-based API, as
a more complete API to control number format. The functions may be found in
The following features are supported:
* Read and write sparse matrices in the CSC format
* Only real, symmetric, assembled matrix are supported (RUA format)
The maxentropy module is unmaintained, rarely used and has not been
well for several releases. Therefore it has been deprecated for this
and will be removed for scipy 0.11. Logistic regression in scikits.learn
good alternative for this functionality. The ``scipy.maxentropy.logsumexp``
function has been moved to ``scipy.misc``.
There are similar BLAS wrappers in ``scipy.linalg`` and ``scipy.lib``.
have now been consolidated as ``scipy.linalg.blas``, and ``scipy.lib.blas``
Numscons build system
The numscons build system is being replaced by Bento, and will be removed in
one of the next scipy releases.
The deprecated name `invnorm` was removed from
this distribution is available as `invgauss`.
The following deprecated nonlinear solvers from ``scipy.optimize`` have been
- ``broyden_modified`` (bad performance)
- ``broyden1_modified`` (bad performance)
- ``broyden_generalized`` (equivalent to ``anderson``)
- ``anderson2`` (equivalent to ``anderson``)
- ``broyden3`` (obsoleted by new limited-memory broyden methods)
- ``vackar`` (renamed to ``diagbroyden``)
``scipy.constants`` has been updated with the CODATA 2010 constants.
``__all__`` dicts have been added to all modules, which has cleaned up the
namespaces (particularly useful for interactive work).
An API section has been added to the documentation, giving recommended
guidelines and specifying which submodules are public and which aren't.
This release contains work by the following people (contributed at least
one patch to this release, names in alphabetical order):
* Jeff Armstrong +
* Matthew Brett
* Lars Buitinck +
* David Cournapeau
* FI$H 2000 +
* Michael McNeil Forbes +
* Matty G +
* Christoph Gohlke
* Ralf Gommers
* Yaroslav Halchenko
* Charles Harris
* Thouis (Ray) Jones +
* Chris Jordan-Squire +
* Robert Kern
* Chris Lasher +
* Wes McKinney +
* Travis Oliphant
* Fabian Pedregosa
* Josef Perktold
* Thomas Robitaille +
* Pim Schellart +
* Anthony Scopatz +
* Skipper Seabold +
* Fazlul Shahriar +
* David Simcha +
* Scott Sinclair +
* Andrey Smirnov +
* Collin RM Stocks +
* Martin Teichmann +
* Jake Vanderplas +
* Gaël Varoquaux +
* Pauli Virtanen
* Stefan van der Walt
* Warren Weckesser
* Mark Wiebe +
A total of 35 people contributed to this release.
People with a "+" by their names contributed a patch for the first time.
For np.gradient(), one can specify a sample distance for each axis to apply
to the gradient. But, all this does is just divides the gradient by the
sample distance. I could easily do that myself with the output from
gradient. Wouldn't it be more valuable to be able to specify the width of
the central difference (or is there another function that does that)?
On Fri, Nov 4, 2011 at 5:26 AM, Pierre GM <pgmdevlist(a)gmail.com> wrote:
> On Nov 03, 2011, at 23:07 , Joe Kington wrote:
> > I'm not sure if this is exactly a bug, per se, but it's a very confusing
> consequence of the current design of masked arrays…
> I would just add a "I think" between the "but" and "it's" before I could
> > Consider the following example:
> > import numpy as np
> > x = np.ma.masked_all(10, dtype=np.float32)
> > print x
> > x[x > 0] = 5
> > print x
> > The exact results will vary depending the contents of the empty memory
> the array was initialized from.
> Not a surprise. But isn't mentioned in the doc somewhere that using a
> masked array as index is a very bad idea ? And that you should always fill
> it before you use it as an array ? (Actually, using a MaskedArray as index
> used to raise an IndexError. But I thought it was a bit too harsh, so I
> dropped it).
Not that I can find in the docs (Perhaps I just missed it?). At any rate,
it's not mentioned in the numpy.ma section on indexing:
The only mention of it is a comment in MaskedArray.__setitem__ where the
IndexError is commented out.
> ma.masked_all is an empty array with all its elements masked. Ie, you have
> an uninitialized ndarray as data, and a bool array of the same size, full
> of True. The operative word is here "uninitialized".
> > This wreaks havoc when filtering the contents of masked arrays (and
> leads to hard-to-find bugs!). The mask of the array in question is altered
> at random (or, rather, based on the masked values as well as the masked
> Once again, you're working on an *uninitialized* array. What you should
> really do is to initialize it first, e.g. by 0, or whatever would make
> sense in your field, and then work from that.
Sure, I shouldn't have used that as the example.
My point was that it's counter-intuitive that something like "x[x > 0] = 0"
alters the mask of x based on the values of _masked_ elements. How it's
initialized is irrelevant (though, of course, it wouldn't be semi-random if
it were initialized in another way).
> > I can see the reasoning behind the way it works. It makes sense that "x
> > 0" returns a masked boolean array with potentially several elements
> masked, as well as the unmasked elements greater than 0.
> Well, "x > 0" is also a masked array, with its mask full of True. Not very
> usable by itself, and especially *not* for indexing.
> > However, wouldn't it make more sense to have MaskedArray.__setitem__
> only operate on the unmasked elements of the "indx" passed in (at least in
> the case where the assigned "value" isn't a masked array)?
> Normally, that should be the case. But you're not working in "normal"
> conditions, here. A bit like trying to boil water on a stove with a plastic
"x[x > threshold] = something" is a very common idiom for ndarrays.
I think most people would find it surprising that this operation doesn't
ignore the masked values.
I noticed this because one of my coworkers was complaining that a piece of
my code was "messing up" their masked arrays. I'd never tested it with
masked arrays, but it took me ages to find, just because I wasn't looking
in places where I was just using common idioms. In this particular case,
they'd initialized it with "masked_all", so it effectively altered the mask
of the array at random. Regardless of how it was initialized, though, it
is surprising that the mask of "x" is changed based on masked values.
I just think it would be useful for it to be documented.
Forgive me if this is already a well-know oddity of masked arrays. I hadn't
seen it before, though.
I'm not sure if this is exactly a bug, per se, but it's a very confusing
consequence of the current design of masked arrays...
Consider the following example:
import numpy as np
x = np.ma.masked_all(10, dtype=np.float32)
x[x > 0] = 5
The exact results will vary depending the contents of the empty memory the
array was initialized from.
This wreaks havoc when filtering the contents of masked arrays (and leads
to hard-to-find bugs!). The mask of the array in question is altered at
random (or, rather, based on the masked values as well as the masked ones).
Of course, once you're aware of this, there are a number of workarounds
(namely, filling the array or explicitly operating on "x.data" instead of
I can see the reasoning behind the way it works. It makes sense that "x >
0" returns a masked boolean array with potentially several elements masked,
as well as the unmasked elements greater than 0.
However, wouldn't it make more sense to have MaskedArray.__setitem__ only
operate on the unmasked elements of the "indx" passed in (at least in the
case where the assigned "value" isn't a masked array)?
i am using mkl 10.1, intel cluster toolkit 11/069, os rhel 5.2 x86_64,
python 2.6, processor is intel xeon
numpy version is 1.6.0
my numpy.test hanging at below point :
Test whether equivalent subarray dtypes hash the same. ... ok
Test whether different subarray dtypes hash differently. ... ok
Test some data types that are equal ... ok
Test some more complicated cases that shouldn't be equal ... ok
Test some simple cases that shouldn't be equal ... ok
test_single_subarray (test_dtype.TestSubarray) ... ok
test_einsum_errors (test_einsum.TestEinSum) ... ok
test_einsum_sums_cfloat128 (test_einsum.TestEinSum) ...
any pointers for this?
i am getting following error.
python -c 'import numpy;numpy.matrix([[1, 5, 10], [1.0, 3j, 4]],
MKL FATAL ERROR: Cannot load libmkl_lapack.so
have installed numpy 1.6.0 with python 2.6.
i have intel cluster toolkit installed on my system. (11/069 version and
mlk=10.1). i have machine having intel xeon processor and rhel 5.2 x86_64
On the behalf of Spyder's development team
(http://code.google.com/p/spyderlib/people/list), I'm pleased to
announce that Spyder v2.1 has been released and is available for
Windows XP/Vista/7, GNU/Linux and MacOS X:
Spyder is a free, open-source (MIT license) interactive development
environment for the Python language with advanced editing, interactive
testing, debugging and introspection features. Originally designed to
provide MATLAB-like features (integrated help, interactive console,
variable explorer with GUI-based editors for dictionaries, NumPy
arrays, ...), it is strongly oriented towards scientific computing and
Thanks to the `spyderlib` library, Spyder also provides powerful
ready-to-use widgets: embedded Python console (example:
http://packages.python.org/guiqwt/_images/sift3.png), NumPy array
editor (example: http://packages.python.org/guiqwt/_images/sift2.png),
dictionary editor, source code editor, etc.
Description of key features with tasty screenshots can be found at:
This release represents a year of development since v2.0 and
introduces major enhancements and new features:
* Large performance and stability improvements
* PySide support (PyQt is no longer exclusively required)
* New profiler plugin (thanks to Santiago Jaramillo, a new contributor)
* Experimental support for IPython v0.11+
* And many other changes: http://code.google.com/p/spyderlib/wiki/ChangeLog
On Windows platforms, Spyder is also available as a stand-alone
executable (don't forget to disable UAC on Vista/7). This all-in-one
portable version is still experimental (for example, it does not embed
sphinx -- meaning no rich text mode for the object inspector) but it
should provide a working version of Spyder for Windows platforms
without having to install anything else (except Python 2.x itself, of
Don't forget to follow Spyder updates/news:
* on the project website: http://code.google.com/p/spyderlib/
* and on our official blog: http://spyder-ide.blogspot.com/
Last, but not least, we welcome any contribution that helps making
Spyder an efficient scientific development/computing environment. Join
us to help creating your favourite environment!
On 2011-11-03 04:22, numpy-discussion-request(a)scipy.org wrote:
> Message: 1
> Date: Wed, 2 Nov 2011 22:20:15 -0500
> From: Benjamin Root<ben.root(a)ou.edu>
> Subject: Re: [Numpy-discussion] in the NA discussion, what can we
> agree on?
> To: Discussion of Numerical Python<numpy-discussion(a)scipy.org>
> Content-Type: text/plain; charset="iso-8859-1"
> On Wednesday, November 2, 2011, Nathaniel Smith<njs(a)pobox.com> wrote:
>> Hi Benjamin,
>> On Wed, Nov 2, 2011 at 5:25 PM, Benjamin Root<ben.root(a)ou.edu> wrote:
>>> I want to pare this down even more. I think the above lists makes too
>>> unneeded extrapolations.
>> Okay. I found your formatting a little confusing, so I want to make
>> sure I understood the changes you're suggesting:
>> For the description of what MISSING means, you removed the lines:
>> - Compatibility with R is valuable
>> - To avoid user confusion, ideally it should *not* be possible to
>> 'unmask' a missing value, since this is inconsistent with the "missing
>> value" metaphor (e.g., see Wes's comment about "leaky abstractions")
>> And you added the line:
>> + Assigning MISSING is destructive
>> And for the description of what IGNORED means, you removed the lines:
>> - Some memory overhead is inevitable and acceptable
>> - Compatibility with R neither possible nor valuable
>> - Ability to toggle the IGNORED state of a location is critical, and
>> should be as convenient as possible
>> And you added the lines:
>> + IGNORE is non-destructive
>> + Must be competitive with np.ma for speed and memory (or else users
>> would just use np.ma)
>> Is that right?
>> Assuming it is, my thoughts are:
>> By R compatibility, I specifically had in mind in-memory
>> compatibility. rpy2 provides a more-or-less seamless within-process
>> interface between R and Python (and specifically lets you get numpy
>> views on arrays returned by R functions), so if we can make this work
>> for R arrays containing NA too then that'd be handy. (The rpy2 author
>> requested this in the last discussion here:
>> When it comes to disk formats, then this doesn't matter so much, since
>> IO routines have to translate between different representations all
>> the time anyway.
> Interesting, but I still have to wonder if that should be on the wishlist
> for MISSING. I guess it would matter by knowing whether people would be
> fully converting from R or gradually transitioning from it? That is
> something that I can't answer.
I probably do not have all possible use-cases but what I'd think of as
the most common is: use R stuff just straight out of R from Python. Say
that you are doing your work in Python and read about some statistical
method for which an implementation in R exists (but not in
Python/numpy). You can just pass your numpy arrays or vectors to the
relevant R function(s) and retrieve the results in a form directly
usable by numpy (without having the data copied around). Should
performances become an issue, and that method be of crucial importance,
you will probably want to reimplement it (C, or Cython, for example).
Otherwise you could pick R's phenomenal toolbox without much effort and
keep those calls to R as part of your code.
In my experience, the later would be the most frequent.
Get some compatibility for the NA "magic" values and that possible
coupling between R and numpy becomes even better by preventing one side
or the other to understand them as non-NA values.
>> I take the replacement of my line about MISSING disallowing unmasking
>> and your line about MISSING assignment being destructive as basically
>> expressing the same idea. Is that fair, or did you mean something
> I am someone who wants to get to the absolute core of ideas. Also, this
> expression cleanly delineates the differences as binary.
> By expressing it this way, we also shy away from implementation details.
> For example, Unmasking can be programmatically prevented for MISSING while
> it could be implemented by other indirect means for IGNORE. Not that those
> are the preferred ways, only that the phrasing is more flexible and
>> Finally, do you think that people who want IGNORED support care about
>> having a convenient API for masking/unmasking values? You removed that
>> line, but I don't know if that was because you disagreed with it, or
>> were just trying to simplify.
> See previous.
>>> Then, as a third-party module developer, I can tell you that having
>>> and independent ways to detect "MISSING"/"IGNORED" would likely make
>>> more difficult and would greatly benefit from a common (or easily
>>> combinable) method of identification.
>> Right, sorry... I didn't forget, and that's part of what I was
>> thinking when I described the second approach as keeping them as
>> *mostly*-separate interfaces... but I should have made it more
>> explicit! Anyway, yes:
>> 4) There is consensus that whatever approach is taken, there should be
>> a quick and convenient way to identify values that are MISSING,
>> IGNORED, or both. (E.g., functions is_MISSING, is_IGNORED,
>> is_MISSING_or_IGNORED, or some equivalent.)
> Ben Root