A disconnected numarray rant
data:image/s3,"s3://crabby-images/519d8/519d85422be0c17231cc3e5e3a3015d006d367d7" alt=""
Hi, I'm taking a 1 month break from computers (i.e. I will be completely off-line), and I have to catch a train in an hour; but I've recently bitten the bullet and made a matrix class I've been using for some time work with numarray; I've written down a number of things that occured to me while I was doing it, including some things which I think are bugs in numarray, so I thought at least posting the bugs would be a useful service; the rest is very raw and essentially unedited cut-and-paste of these notes -- sorry about that and I hope it doesn't contain anything particularly offensive. P.S. just dumped the code for the matrix class (nummat) at http://www.dcs.ex.ac.uk/~aschmolc/Stuff/ 'as The following are my notes: Things that fairly clearly seem to be bugs: - numarray.Int32 etc. can't be pickled - ``a = array(1+0j); a.imag = a.real * 10`` => IndexError - array(0, type=Float64) + 1e3000 => `inf` with right error modes but array(0, type=Float32) + 1e3000 => `OverflowError` - numarray.array(10)/numarray.array(0) => 0 - numarray.array(10000000000000L) => array(1316134912) - numarray.where(0,1,0) => array([0]) - l = [1,2,3]; numarray.put(l,numarray.array([1,2,0]),[0,0,0]); l => [1, 2, 3] a = array([1,2,3]); numarray.put(a,numarray.array([1,2,0]),[0,0,0]); a => array([0, 0, 0]) - repr(numarray.array([],typecode='i')) (etc. etc.) => "numarray.array([])" - getattr(array([1,2,3]), '_aligned') => SystemError - obscure: numarray.where(0, matrix(568, convert_scalars=True),2) => ValueError (tries __len__ which fails, as len(array(568)) also fails) Numeric incompatiblilities (that are either undocumented or bug-like) - numarray.array('a', typecode='O') => TypeError (object arrays) - for extra fun try: numarray.array(1, type=numarray.Object) -=> RuntimeError something entirely different - nonzero is completely incompatible - shape(None) etc. no longer works (IMHO a bug) - cross_correlate & average missing - left_shift et al missing - numarray.sqrt(a,a) is None (*not* the result, as it used to be) - num.put(a, [0,1,2,3], [10,20]) style behavior seems unavailable (without numarray.numeric) put(array([[ 0., 1., 2.], [ 3., 4., 5.]]), [1, 4], [10,40]) fails - boolean testing (not even bool(array(0)) works; I'm not sure this is good) - Generally different handling of rank0-arrays; e.g. ``type(num.array(1.0) + 0) is float``; one potentially very nasty gotcha are inplace operations (e.g. a**=2) which have totally different semantics for python scalars and rank0 arrays, which, unlike Attribute errors on ``a.shape``, can lead to nasty bugs in corner cases (e.g. when a reduction just infrequently yields scalar ``a``) -- I think this should be mentioned in a gotchas section (another possible entry would be the need to use .copy() to **save** memory on slicing and 1xN, Nx1 matrices versus vectors (people are not used to thinking properly about rank from mathematical training or matlab exposure)). - asarray downcasts arrays (e.g.: asarray(array([1.,2.,3.]),'i')) - numarray.ones(-5) => MemoryError (ValueError would be nicer) - numarray.ones(2.0), numarray.ones([2]) fail (cf. numarray.range(2.0)) b=num.array([[1,2,3,4],[5,6,7,8]]*2) assert eq(num.diagonal(b), [1,6,3,8]) assert eq(num.diagonal(b, -1), [5,2,7]) c = num.array([b,b]) assert eq(num.diagonal(c,1), [[2,7,4], [2,7,4]]) - no a.toscalar() !!! - matrixmultiply in the docs - what's the point of swapaxes (i.e. why not have a generalized in-place transpose?) - what's the point of innerproduct? - indexing by a list is different from indexing by tuple (I haven't had time to look closely at the docs whether that's intentional) - doesn't know about Numeric's bizzarre '\x0b' typecode - numarray.sqrt.reduce([]) raises (sensibly) TypeError, not ValueError - len(array(1)) or array(1)[0] won't work anymore (understandable, but should be documented) - (should maximim, minimum reduce to -inf and inf?) - <built-in method reduce of _BinaryUFunc object at 0x82dfc9c> is not a very helpful repr; should be possible to get to the ufunc itself - as in Numeric numarray.maximum.reduce(numarray.array([0,-0.])) => -0.0 - __array__ protocol no longer supported (how can a non-derived class convert itself efficiently to an array?) Documentation Gotchas - p. 34 IMO row vector is used incorrectly; row and column vectors are really matrices (i.e. have rank 2) so ``array([[1,2,3]])`` would be a row vector - No proper explanation of differences between Numeric and numarray, or numarray.numeric module differences to proper (e.g. argmin) - No migration and best-practice advice (e.g. there should be a standard way for packages which work with both numarray and numeric as backends to let the user choose his preference; how about setting an environment var NumPy or something?) Waffle ------ - there *really* ought to be an array equality function (with optional tolerance); it's quite difficult to get right for are normal user (nans; zero-size arrays etc.) and it's often required, especially for testing - rank preserving reduction seems useful as an option would be nice -- e.g. to subtract out or divide by the reduced portion (which currently won't e.g. work for columns without adding a unit-dimension by hand). Design The (AFAICS) benefit-free but downside-rich introduction of `type` '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' Is there any reason that Typecode objects that compare as desired to the relevant strings ("i", "d") wouldn't have done? Now there is an explosion and confusion of interfaces -- some numpy code will now only except type(code)s as "typecode" keyword parameter (even in numarray! see numarray.mlab!) and other stuff Never mind that type already is a highly overused word in the python world. The big method bloat. ''''''''''''''''''''' As it says in the Numeric manual introductions there were "good reasons" for "very few array methods" -- now there are **56** public methods and 8 public attributes (public == not starting with '_'); of those 56 methods about 11 are accessors and of the rest about half are redundant or worse (i.e. they either also exist as numarray functions (argmin, argmax, diagonal, ...) or they really ought to be functions (mean, stddev) or they are quite confusing (``a.min``, ``a.max`` which behave quite differenlty from ``a.argmin`` and ``a.argmax``, never mind ``numarray.minimum``) or simply utterly pointless (``a.nelements`` == ``a.size``)). - argmin, argmax : what's wrong with numarray.argmin, numarray.argmax??? Why do argmin/argmax and max/min have completely different interfaces??? If there really is a need for these (there isn't) anything a.min and a.max should be called a.flatmin, a.flatmax - diagonal, mean, nelements, nonzero, ... - perversely the **only** function that I can think off that could have sensibly become a method hasn't: ``put`` (it used to work only on arrays under Numeric and not without reason, so making it a method would have been sensible; numarray.put of course also "works" on non-arrays, it just doesn't do anything with them) Test Code ''''''''' numtest.py doesn't inspire full confidence (it's about 1000 lines of actual code but it doesn't seem that clearly structured and AFAICT contains no single loop (and that despite the diversity of shapes, types etc. that exist in numarray -- why not try something slightly more systematic?)).
data:image/s3,"s3://crabby-images/4e1bf/4e1bff9f64c66e081948eead1d34d3ee25b06db6" alt=""
Hi Alexander, Thanks for taking the time to provide us with feedback. I've responded to many of your points below. [and in the interest of keeping the text bloat down, I've interjected my own comments in brackets--Perry] On Tue, 2004-10-12 at 05:37, Alexander Schmolck wrote:
Hi,
I'm taking a 1 month break from computers (i.e. I will be completely off-line), and I have to catch a train in an hour; but I've recently bitten the bullet and made a matrix class I've been using for some time work with numarray; I've written down a number of things that occured to me while I was doing it, including some things which I think are bugs in numarray, so I thought at least posting the bugs would be a useful service; the rest is very raw and essentially unedited cut-and-paste of these notes -- sorry about that and I hope it doesn't contain anything particularly offensive.
P.S. just dumped the code for the matrix class (nummat) at http://www.dcs.ex.ac.uk/~aschmolc/Stuff/
'as
The following are my notes:
Things that fairly clearly seem to be bugs: - numarray.Int32 etc. can't be pickled
Known limitation, but OK. Arrays can be pickled, as can Numeric typecodes so I'm not sure how critical this omission is.
- ``a = array(1+0j); a.imag = a.real * 10`` => IndexError - array(0, type=Float64) + 1e3000 => `inf` with right error modes but array(0, type=Float32) + 1e3000 => `OverflowError` - numarray.array(10)/numarray.array(0) => 0 - numarray.array(10000000000000L) => array(1316134912) - numarray.where(0,1,0) => array([0])
There seems to be an infinity of rank-0 issues and so little justification for having them that at one point we considered ripping them out altogether. Noted, but low priority. [Amen. If I had known the problems that rank-0 zero arrays would cause I think I would have excluded them. I'm not sure I see the need for them now that coercion rules have changed and helper functions to change scalars into rank-1 len-1 arrays which serve almost all other purposes. I'm interested in seeing what real purpose they serve now (I understand the backward compatibility issue, but backward compatibility is not the be all and end all for numarray; more on that later)]
- l = [1,2,3]; numarray.put(l,numarray.array([1,2,0]),[0,0,0]); l => [1, 2, 3]
Should raise a TypeError I guess.
a = array([1,2,3]); numarray.put(a,numarray.array([1,2,0]),[0,0,0]); a => array([0, 0, 0])
I don't see what's wrong here.
- repr(numarray.array([],typecode='i')) (etc. etc.) => "numarray.array([])"
Zero length arrays are rather like rank-0 arrays: low priority. Agreed... this is a small wart.
- getattr(array([1,2,3]), '_aligned') => SystemError
Interesting. I've been thinking about ripping out the _align and _contiguous self-test hacks for a long time. You've made up my mind.
- obscure: numarray.where(0, matrix(568, convert_scalars=True),2) => ValueError (tries __len__ which fails, as len(array(568)) also fails)
I think this may boil down to "no where() for object arrays". numarray.where() can't handle object arrays and there is no numarray.objects.where(). Not implemented yet.
Numeric incompatiblilities (that are either undocumented or bug-like)
The best Numeric compatibility in numarray comes from: import numarray.numeric as Numeric It's still not perfect, but it is more compatible than ordinary numarray.
- numarray.array('a', typecode='O') => TypeError (object arrays) - for extra fun try: numarray.array(1, type=numarray.Object) -=> RuntimeError something entirely different
Object arrays in numarray do not have the synergy they have in Numeric. In particular, numarray.array() can't create them, only numarray.objects.array(). [At the time we added object arrays, we noticed that they were not safe in Numeric; that is, Numeric was not properly handling reference counts of objects in arrays for at least some operations and it was possible to segfault object arrays. This may have changed since then; we haven't had a chance to check the current status. But the point is that handling object arrays safely is a lot more than just loading them with object pointers. Any function that can set values in arrays needs to handle their refcounts, and that isn't all that trivial. We took a short cut of using a Python implementation for object arrays that doesn't have all the old functionality, but also didn't have the problems that they did at the time.]
- nonzero is completely incompatible
numarray.numeric covers this. numarray's nonzero() is more powerful, capable of handling multidimensional arrays, so it returns a tuple of values rather than a single value. It's unfortunate that we chose to use the name nonzero() for the "new" function; it has the right interface and the wrong name. Keep in mind though, our compatibility goals have grown immensely since we started.
- shape(None) etc. no longer works (IMHO a bug)
This may be related to the object array synergy. I think numarray.asarray() is the problem here, since it doesn't know how to create object arrays.
- cross_correlate & average missing
I think cross_correlate is in numarray.convolve.correlate. It was a conscious choice not to put it in core numarray. Average has never been implemented and should be, especially since it has different semantics than the mean() method.
- left_shift et al missing
These were renamed lshift and rshift. Note that << works fine. Synonyms should probably be added.
- numarray.sqrt(a,a) is None (*not* the result, as it used to be)
What do you want here? What we have now is, IMO, correct. [Amen. This was intentionally changed from Numeric.]
- num.put(a, [0,1,2,3], [10,20]) style behavior seems unavailable (without numarray.numeric)
I wasn't exactly sure what the expected behavior was for this, but guessed is was some kind of repeat. If that's what the behavior was, Perry and I don't really like it. Besides, numarray.numeric.put *is* Numeric.put, modulo numarray underpinnings.
put(array([[ 0., 1., 2.], [ 3., 4., 5.]]), [1, 4], [10,40]) fails
numarray.put() does have different semantics for multi-dimensional destinations... you need multi-dimensional indexes (i.e. a tuple of index arrays). Again, there's now numarray.numeric.put().
- boolean testing (not even bool(array(0)) works; I'm not sure this is good)
[I am. This was a clear and explicit decision to not replicate Numeric behavior. I'm convinced that it is the right decision. There is just too much confusion about what the truth value of an array should be. Helper functions should be used to make it unambiguous.]
- Generally different handling of rank0-arrays; e.g. ``type(num.array(1.0) + 0) is float``; one potentially very nasty gotcha are inplace operations (e.g. a**=2) which have totally different semantics for python scalars and rank0 arrays, which, unlike Attribute errors on ``a.shape``, can lead to nasty bugs in corner cases (e.g. when a reduction just infrequently yields scalar ``a``) -- I think this should be mentioned in a gotchas section
We have areduce() for this case, which always returns an array.
(another possible entry would be the need to use .copy() to **save** memory on slicing and 1xN, Nx1 matrices versus vectors (people are not used to thinking properly about rank from mathematical training or matlab exposure)).
[You will need to elaborate about what you mean here. E.g., as to the first: I'm guessing you mean when a slice is taken and then the original array is deleted. But it isn't clear.]
- asarray downcasts arrays (e.g.: asarray(array([1.,2.,3.]),'i'))
True enough. Is there some reason why the method should silently succeed (I know we wanted that) and the function should not?
- numarray.ones(-5) => MemoryError (ValueError would be nicer)
Easy to change.
- numarray.ones(2.0),
This fails, and that's fine by me. The idea of floating point shapes seems bogus.
numarray.ones([2])
AFIK, this works, and should work.
fail (cf. numarray.range(2.0))
IMHO, arange() is a special case and not really equivalent to numarray.ones().
b=num.array([[1,2,3,4],[5,6,7,8]]*2) assert eq(num.diagonal(b), [1,6,3,8]) assert eq(num.diagonal(b, -1), [5,2,7]) c = num.array([b,b]) assert eq(num.diagonal(c,1), [[2,7,4], [2,7,4]]) - no a.toscalar() !!!
a.toscalar() is written a[()] in numarray. [This is one method that shouldn't be there IMO. What would people expect it to do for arrays with len>1 ?]
- matrixmultiply in the docs
OK.
- what's the point of swapaxes (i.e. why not have a generalized in-place transpose?)
It's a very common function in implementation of numarray/Numeric. [In many cases it is far easier to use than an generalized transpose (which does exist, but requires all axes to be explicitly given)]
- what's the point of innerproduct?
Compatibility. [For a while the flavor is: "dammit, why aren't you compatible?" Now it's: "dammit, why are you compatible?"]
- indexing by a list is different from indexing by tuple (I haven't had time to look closely at the docs whether that's intentional)
It's intentional. Indexing by a list is "array" indexing. Indexing by a tuple is not. Thus, a 3D array by [1,2,3] is pulling out 2D blocks, while (1,2,3) is pulling out a single scalar. [In particular, tuples have a special meaning for indexing; this distinction is unavoidable since it is a Python language issue.]
- doesn't know about Numeric's bizzarre '\x0b' typecode
Me either. Should we add this? [Not unless there is a good reason. What's it for? Why are you using it (particularly since you called it bizarre)?]
- numarray.sqrt.reduce([]) raises (sensibly) TypeError, not ValueError
Got lucky I guess.
- len(array(1)) or array(1)[0] won't work anymore (understandable, but should be documented)
OK.
- (should maximim, minimum reduce to -inf and inf?)
Don't they?
- <built-in method reduce of _BinaryUFunc object at 0x82dfc9c> is not a very helpful repr; should be possible to get to the ufunc itself
Doesn't this comment fly in the face of Python itself? [I imagine it is possible, but why? repr(dir) doesn't give you a usable function creator, nor does it work in Numeric.]
- as in Numeric numarray.maximum.reduce(numarray.array([0,-0.])) => -0.0
Talk about fine points... noted. I think the problem is that 0.0 == -0.0, so there's no way for the reduction to get it right without adding special code to look for this case, and that isn't gonna happen without a strong case being made. [Again, a very good case needs to be made for handling this. I doubt that it is important to many, and as Todd mentions, not easy to handle.]
- __array__ protocol no longer supported (how can a non-derived class convert itself efficiently to an array?)
Maybe an old-timer can explain how this worked for Numeric. I think this is only partially implemented in numarray and that maybe we need to add a check for an __array__() method to numarray.array().
Documentation Gotchas - p. 34 IMO row vector is used incorrectly; row and column vectors are really matrices (i.e. have rank 2) so ``array([[1,2,3]])`` would be a row vector
Sounds reasonable.
- No proper explanation of differences between Numeric and numarray, or numarray.numeric module differences to proper (e.g. argmin)
If there is, I don't know where it is. Noted, but I'm not really an encyclopedia of these facts myself.
- No migration and best-practice advice (e.g. there should be a standard way for packages which work with both numarray and numeric as backends to let the user choose his preference; how about setting an environment var NumPy or something?)
We're just working this out ourselves. [Let me elaborate more. We haven't really had much experience yet porting tons of Numeric code (MA is about the only example). We are working on scipy now so I expect that in a few months we will know much better what the most important porting issues are. At the moment, this is better documented by others.]
Waffle [meaning?] ------
- there *really* ought to be an array equality function (with optional tolerance); it's quite difficult to get right for are normal user (nans; zero-size arrays etc.) and it's often required, especially for testing
You're right. Want submit one? [Make sure it isn't dependent on the underlying C compiler's libraries for testing floating point special values!]
- rank preserving reduction seems useful as an option would be nice -- e.g. to subtract out or divide by the reduced portion (which currently won't e.g. work for columns without adding a unit-dimension by hand).
Sounds like an interesting idea, but also method bloat.
Design
The (AFAICS) benefit-free but downside-rich introduction of `type` ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Is there any reason that Typecode objects that compare as desired to the relevant strings ("i", "d") wouldn't have done? Now there is an explosion and confusion of interfaces -- some numpy code will now only except type(code)s as "typecode" keyword parameter (even in numarray! see numarray.mlab!) and other stuff
Never mind that type already is a highly overused word in the python world.
Personally, I like type because it's succinct and we have type objects, not single character codes. More importantly, Perry likes type, and the bottom line is that it's his shot to call and he's called it. [We wrestled with this a while. Given that the representation of the type had changed from a character code, typecode is clearly misleading and inappropriate. It is there only for backward compatibility; for new code to be used under numarray only, people shouldn't use it. Type certainly seemed by far the most descriptive and accurate term. It does have the drawback of overloading the type function. Other considerations were things like atype, but type is what we went with.]
The big method bloat. '''''''''''''''''''''
As it says in the Numeric manual introductions there were "good reasons" for
I actually don't buy the reasons myself. Some methods are natural, convenient, and good so I need to hear more voices arguing this point before I'll budge. Clearly there is *some* bloat, but identifying what to ax is more difficult. I suppose we could do a vote to clean this up.
"very few array methods" -- now there are **56** public methods and 8 public attributes (public == not starting with '_'); of those 56 methods about 11 are accessors and of the rest about half are redundant or worse (i.e. they either also exist as numarray functions (argmin, argmax, diagonal, ...) or
Which of the public attributes do you have a problem with? Which accessors?
they really ought to be functions (mean, stddev) or they are quite confusing
The need for these is common so I thought it would be good to add them. Functions could be added as well.
(``a.min``, ``a.max``
These require tricks to get right so we added them. The doc-strings explain what they do.
which behave quite differenlty from ``a.argmin`` and ``a.argmax``,
Good point. These are inconsistent with min and max, which were added independently at a later date. I'm thinking we should deprecate the argmin and argmax methods, which I added hoping to do polymorphism for strings and records and if I recall correctly never did anyway. IMHO, min(), max(), mean(), and stddev() are simple, useful, and should remain.
never mind ``numarray.minimum``) or
min != minimum, and because it is a little tricky to get right, we codified it as a method.
simply utterly pointless (``a.nelements`` == ``a.size``)).
I added nelements() because I needed it and didn't know about a.size()... simple as that. a.size() came later for compatibility only. [I'll argue that nelements is far clearer in meaning. What does size mean? Total bytes? Total number of elements? Sorry, I disagree on this one.]
If there really is a need for these (there isn't) if anything a.min and a.max should be called a.flatmin, a.flatmax
flatmin is certainly clear, but the min/max docstrings also explain it with no fuzz.
- diagonal, mean, nelements, nonzero, ...
nonzero(), and diagonal() I could care less about so they can probably be deprecated and removed. I like mean().
- perversely the **only** function that I can think off that could have sensibly become a method hasn't: ``put`` (it used to work only on arrays under Numeric and not without reason, so making it a method would have been sensible; numarray.put of course also "works" on non-arrays, it just doesn't do anything with them)
Well, we need the numarray.put() function for compatibility, and there's already a more succinct syntax for put(), which is array based indexing so I don't see any point in adding a put() method.
Test Code ''''''''' numtest.py doesn't inspire full confidence (it's about 1000 lines of actual code but it doesn't seem that clearly structured and AFAICT contains no single loop (and that despite the diversity of shapes, types etc. that exist in numarray -- why not try something slightly more systematic?))
Testing could certainly be better. unittest might work better for this kind of thing than doctest. I agree that we should test for a wider variety of shapes, types, sizes, and behaviors but it takes time and effort to do it so it hasn't been done yet. There's little doubt we'd find bugs and the system would be better for it. [On the other hand, is it the most important thing to do next? Any volunteers to improve the test suite? It may not be the most complete and systematic one out there, but it's at least as good as the one for Numeric ;-)] There's a lot of input here. We'll see what we can do. Thanks again. Regards, Todd [A few more editorial comments. When we started numarray, compatibility was not high on the list of priorities, so the initial implementation didn't focus on it. A number of the problems you point out reflect that origin. While it is more important, it isn't the only guide. We seek compatibility when there is no strong reason to be incompatible. But there are a number of issues where we definitely wanted different behavior (if it were to be completely compatible, we wouldn't have bothered in the first place; we needed some changes). Given the odd corners you've run into, it makes me curious to see the code that generated this; particularly with regard to rank-0 arrays. If I get a chance I'll take a look at the link you provided. I wonder if it is typical of what other users will encounter or not. I guess our experience in porting scipy will give us a better indication. To summarize what we see as work that should be done to address the points made: rank-0 issues: 1) a.imag doesn't work 2) array(0, type=Float64) + 1e3000 => `inf` with right error modes but array(0, type=Float32) + 1e3000 => `OverflowError` 3) numarray.array(10)/numarray.array(0) => 0 4) numarray.array(10000000000000L) => array(1316134912) 5) numarray.where(0,1,0) => array([0]) 6) documentation of behavior (how to turn into scalar, that len and [0] indexing doesn't work, etc.) Others 1) puts into lists should raise Type error l = [1,2,3]; numarray.put(l,numarray.array([1,2,0]),[0,0,0]); l => [1, 2, 3] 2) repr for zero length arrays needs to show type and other info. 3) rip out _align and _contiguous self-test hacks 4) improved object array handling (e.g., where and the like) 5) average function 6) change MemoryError to ValueError for ones(-5) 7) document matrixmultiply 8) support for __array__ protocol? 9) Documentation fix for p34 row vector usage. 10) Numeric to numarray conversion guide 11) Better tests Most of these are not likely to get immediate attention as our focus now is on integrating scipy. To the extent they make it easier to do, their priority may be raised. There are a lot of "should"s but we have limited resources just like anyone else; we can't do it all at once.]
participants (2)
-
Alexander Schmolck
-
Todd Miller