[Matrix-SIG] Numeric Nits

Perry Greenfield perry@stsci.edu
Tue, 22 Jun 1999 16:27:59 -0400 (EDT)


> From hinsen@cnrs-orleans.fr Tue Jun 22 06:43 EDT 1999
> Date: Tue, 22 Jun 1999 12:41:33 +0200
> To: perry@stsci.edu
> Subject: Re: [Matrix-SIG] Numeric Nits
> 

> 
> Did you try the Python Imaging Library (PIL)? It contains specialized
> array-like objects for dealing with images. Ideally NumPy and PIL
> would use a common low-level array object, but we don't live in a
> perfect world!
> 
We have looked at it but Numeric comes far closer to having the features
needed for astronomical image processing than does PIL (though it may
prove very useful for displaying such images). We need the full range
of mathematical capabilities of Numeric.

> > The suggested solution (using rank-0 Numeric arrays instead of simple
> > Python constants) often does not work because any Python operation on
> > those arrays turns them back into (promoted) Python scalars.  E.g., after
> >     a = array(1.0, Float32)
> >     b = -a
> > b is a standard Python 64-bit scalar rather than a Float32.
> 
> That sounds like a problem that can be solved easily. Instead of using
> rank-0 arrays, one could use special "Float32 Python scalars", which
> would of course have to be implemented, presumably in NumPy, but that
> is not much work. You would still have to create such objects with
> somewhat clumsy syntax (e.g. Float32(3.5)), because extension modules
> can't introduce new syntax rules, but there would be no more problems
> with constant upcasting once you have made sure that all data is
> in some float32 type.
>
Yes, it is a possibility, but we believe it is not as good a solution
reasons given a little later.
 
> Another solution to the upcasting problem could be the introduction of
> a variant of Float32 arrays that are higher up in the upcasting
> hierarchy than Float64. Then mixed expressions would automatically end
> up in Float32. Whoever uses this type would of course have to watch
> out for precision loss, but it seems to me that in your kind of
> application this is not a problem.
>
This is close to one idea we had tossed around of creating a variant 
of numeric arrays that would not be upcast. Still...
 
> > very maintainable either).  We wonder whether anyone has seriously used
> > Numeric for 16 bit ints or 32 bit floats in cases (like ours) where the
> > memory penalty of converting these to 32 bit ints and doubles is not
> > acceptable.
> 
> Probably not!
>
Then your comment below regarding incompatibility is not likely
applicable! :-)
 
> > The nicest solution from the numerical point of view would be for
> > Python to support these other types.  We do not expect that to happen
> > (nor should it happen).
> 
> Why not? As long as you don't expect to get special syntax for
> literals of the new type, this looks like a relatively simple
> solution (see above).
> 
> > But the problem could be ameliorated by changing Numeric's default
> > behavior in casting constants.  If combining an array and a python
> > scalar of similar types (some form of int with some form of int, or
> > float with double or int), then the default behavior should be to cast the
> > python type to the array type, even if it means down casting.  
> 
> But this would be incompatible with both the current NumPy behaviour
> (possibly breaking existing code) and with standard Python behaviour,
> causing headaches to new users. I think that the principle "never lose
> precision without clearly saying that you want it" is a good one.
>
If you agree that very few people are using Float32 and Int16, then it
is likely that little code will be broken. But even if there were
subtantial amounts of code that would be broken, there is a larger issue
to be addressed. We believe that a large segment of the scientific
community expects to be able to operate with arrays of these types;
these expectations are met with most other array-based languages available
to them. We are arguing that if Python/Numeric is to be seen as a viable
alternative to things like IDL and Matlab, this capability has to be
available in a relatively painless way. Every alternative discussed so
far has some pros and cons. The following summarizes the proposed solutions.

1) Current stituation: consistent with python upcasting behavior, but
use of Float32 and Int16 so painful as to be virtually unusable.
This locks out all users that desire to use these types.

2) Add special scalar types to NumPy. This is better than option 1) yet
still leads to rather ugly code that will put off many scientific users.
(e.g., b = Float32(2.)*(a - Float32(1.)) instead of  b = 2*(a-1) )

3) Add new array types that don't upcast with scalars. Better than 1)
and 2) in our opinion but now we are complicating the array types available
(and if new types are introduced to handle indexing vs mask variants, it
becomes even worse). We are not in a position to be sure but this appears
to probably require more work to implement than changing the existing
default behavior for upcasting. But it certainly is worth considering.

4) Change default scalar casting behavior. Main drawback is that it
introduces casting behavior different than people (Python or otherwise)
are used to. Well, at one level anyway. Consider a new user of Python
and NumPy. They type:

	a = b + 1.
	
Where b is a single precision array. They are used to expecting that 1.
is single precision and that the result ought to be also.  We argue that
the naive user will more likely be surprised by getting a double result
than a single precision result because of this despite the fact that  they
expect the usual upcasting rules. The fundamental problem is that because
Python doesn't support single precision floats there is going to be
unexpected behavior no matter what. We believe that treating arrays 
differently in this respect is likely what most users want.

Those that have no use for single precison arrays really should be little
affected. They are unlikely to be using them now. All the default array
creators give you the default Python type. People who don't use the lower
precision types don't need to worry much about accidently casting down.

After all, there are other aspects of Numeric arrays which are not compatible
with Python either. For example, the semantics of a[0:5] with regard to
copying are completely different. Treating arrays as the heavyweight object
that dictates casting behavior strikes us as the compromise that is
most reasonable.

We believe that changing the default behavior will likely break little code
(but if that isn't true, whomever depends on it now, please let us know).
But even if it did, we think it is worth it to make Python/Numeric acceptable
to a much larger community. This is a comparitively good time to make
such a change.

*********************
 
> Working with reference counts is almost by definition something tricky.
> I won't comment on this idea (which looks interesting, but too
> difficult to verify quickly), but why don't you try it out in plain
> Python, by creating a Python wrapper around array objects which
> implements the optimizations you propose? (You can get the reference
> count via the function sys.getrefcount().)
> 
Hmmm. That is something that we a trying right now.

> [about "put" function]
> 
> > The absence of an agreed upon general solution or someone to do the
> > work should not prevent the adoption and distribution of
> > less-than-general solutions (such as array_set, that comes with Gist)
> > with the standard distribution.  Surely a less-than-general solution is
> > preferable to none.
> 
> Except that it would create a future obligation to maintain
> compatibility. Perhaps NumPy should contain a special module with
> "quick and dirty" code which might be replaced by incompatible
> improvements in the future.
>
True, but we can use different names for the prototype function and
a final one. It's still no excuse for not having any.
 
> > We were surprised that some of the most heavily used functions in
> > Numeric are implemented in Python rather than C, despite their apparent
> > simplicity.  At least some of these really need to be implemented in C
> > to make them decently efficient: arrayrange, nonzero, compress, clip,
> > where.  The current Python versions are especially costly in memory
> 
> Perhaps no one else ever had efficiency problems with them; I
> certainly didn't!
> 
We're sure they are good enough for many uses, but with large arrays it
does make a big difference.

> > We also see a need for a function that rebins an array, e.g., zooming
> > the array by pixel replication or dezooming it by summing blocks of
> > pixels (not simply by slicing to select every N-th pixel.)  Both of
> 
> The first can be done by repeat() (at least along one axis). The second
> problem may be what the mysterious reduceat() operation does, although
> nobody seems to understand it well enough to decide!
> 
> But liks in any case of presumed missing functionality, I am sure that
> a code contribution would be happily accepted by the maintainers...
> 
We are willing to contribute. The purpose of our list was to elicit
comments about whether someone had already solved these problems or
had better suggestions for solutions. But if necessary, we will be willing
to do some or all of this work.

> 
> > There are several Python functions for computing histograms from
> > Numeric arrays, but there still seems to be a need for a simple,
> > general histogram function written in C for efficiency.  This means not
> 
> I have never found a need for a C function. Please try the histogram
> class in version 2 of my ScientificPython package (available at
> ftp://dirac.cnrs-orleans.fr/pub/); it does not sort the array
> and doesn't create huge intermediates. It has performed well for
> all my applications.
> -- 
Actually, this is a good illustration of how performance is context
dependent. What is acceptable in one circumstance may not be suitable
for large datasets. We tried the above routine on a large dataset and
found that it took 87 seconds to compute a 1000 bin histogram on
a 100,000 element double array with random values on a lowly Sparc4.
By comparison, IDL (which does use underlying C code to do the histogram)
took only 0.14 seconds. This is approaching three orders of magnitude
in speed. Since we will be dealing with arrays far larger than this,
it matters a lot to us.