[Numpy-discussion] PEP 209: Multi-dimensional Arrays

Wed Feb 14 12:09:45 EST 2001

Rob W. W. Hooft writes:
 > Some random PEP talk.
 > 
 > >>>>> "PB" == Paul Barrett <Barrett at stsci.edu> writes:
 >     
 >  PB> Its relation to the other types is defined when the C-extension
 >  PB> module for that type is imported.  The corresponding Python code
 >  PB> is:
 >     
 >      >> Int32.astype[Real64] = Real64
 > 
 > I understand this is to be done by the Int32 C extension module. 
 > But how does it know about Real64?

This approach assumes that there are a basic set of predefined types.
In the above example, the Real64 type is one of them.  But let's
consider creating a completely new type, say Real128.  This type knows
its relation to the other previously defined types, namely Real32,
Real64, etc., but they do not know their relationship to it.  That's
still OK, because the Real128 type is imbued with this required
information and is willing to share it with the other types.

By way of bootstrapping, only one predefined type need be known, say,
Int32.  The operations associated with this type can only be Int32
operations, because this is the only type it knows about.  Yet, we can
add another type, say Real64, which has not only Real64 operations,
BUT also Int32 and Real64 mixed operations, since it knows about
Int32.  The Real64 type provides the necessary information to relate
the Int32 and Int64 types.  Let's now add a third type, then a fourth,
etc., each knowing about its predecessor types but not its successors.

This approach is identical to the way core Python adds new classes or
C-extension types, so this is nothing new.  The current types do not
know about the new type, but the new type knows about them.  As long
as one type knows the relationship between the two that is sufficient
for the scheme to work.

 >  PB> Attributes:
 >  PB> .name:                  e.g. "Int32", "Float64", etc.
 >  PB> .typecode:              e.g. 'i', 'f', etc.
 >  PB> (for backward compatibility)
 > 
 > .typecode() is a method now.

Yes, I propose that it become a settable attribute.

 >  PB> .size (in bytes):       e.g. 4, 8, etc.
 > 
 > "element size?"

Yes.

 >  >> add.register('add', (Int32, Int32, Int32), cfunc-add)
 > 
 > Typo: cfunc-add is an expression, not an identifier.

No, it is a Python object that encompasses and describes a C function
that adds two Int32 arrays and returns an Int32 array.  It is
essentially a Python wrapper of a C-function UFunc.  It has been
suggested that you should also be able to register Python expressions
using the same interface.

 > An implementation of a (Int32, Float32, Float32) add is possible and
 > desirable as mentioned earlier in the document. Which C module is
 > going to declare such a combination?
 > 
 >  PB> asstring():             create string from array
 > 
 > Not "tostring" like now?

This is proposed so as to be a little more consistent with Core Python
which uses 'from-' and 'as-' prefixes.  But I'm don't have strong
opinions either way.

 >  PB> 4.  ArrayView
 > 
 >  PB> This class is similar to the Array class except that the reshape
 >  PB> and flat methods will raise exceptions, since non-contiguous
 >  PB> arrays cannot be reshaped or flattened using just pointer and
 >  PB> step-size information.
 > 
 > This was completely unclear to me until here. I must say I find this a
 > strange way of handling things. I haven't looked into implementation
 > details, but wouldn't it feel more natural if an Array would just be
 > the "data", and an ArrayView would contain the dimensions and
 > strides. Completely separated. One would always need a pair, but more
 > than one ArrayView could use the same Array.

In my definition, an Array that has no knowledge of its shape and type 
is not an Array, it's a data or character buffer.  An array in my
definition is a data buffer with information on how that buffer is to
be mapped, i.e. shape, type, etc.  An ArrayView is an Array that
shares its data buffer with another Array, but may contain a different 
mapping of that Array, ie. its shape and type are different.

If this is what you mean, then the answer is "Yes".  This is how we
intend to implement Arrays and ArrayViews.

 >  PB> a.  _ufunc:
 > 
 >  PB> 1.  Does slicing syntax default to copy or view behavior?
 > 
 > Numeric 1 uses slicing for view, and a method for copy. "Feeling"
 > compatible with core python would require copy on rhs, and view on lhs
 > of an assignment. Is that distinction possible?
 > 
 > If copy is the default for slicing, how would one make a view?

B = A.V[:10] or A.view[:10] are some possibilities.  B is now an
ArrayView class.

 >  PB> 2.  Does item syntax default to copy or view behavior?
 > 
 > view.
 > 
 >  PB> Yet, c[i] can be considered just a shorthand for c[i,:] which
 >  PB> would imply copy behavior assuming slicing syntax returns a copy.
 > 
 > If you reason that way, then c is just a shorthand for c[...] too.

Yes, that is correct, but that is not how Python currently behaves.
The motivation for these questions is consistency with core Python
behavior.  The current Numeric does not follow this pattern for
reasons of performance.  If we assume performance is NOT an issue
(ie. we can get similar performance by using various tricks), then
what behavior is more intuitive for the average, and novice, user?

 >  PB> 3.  How is scalar coercion implemented?
 > 
 >  PB> Python has fewer numeric types than Numeric which can cause
 >  PB> coercion problems.  For example when multiplying a Python scalar
 >  PB> of type float and a Numeric array of type float, the Numeric array
 >  PB> is converted to a double, since the Python float type is actually
 >  PB> a double.  This is often not the desired behavior, since the
 >  PB> Numeric array will be doubled in size which is likely to be
 >  PB> annoying, particularly for very large arrays.
 > 
 > Sure. That is handled reasonably well by the current Numeric 1.
 > 
 > To extend this, I'd like to comment that I have never really understood
 > the philosophy of taking the largest type for coercion in all languages.
 > Being a scientist, I have learned that when you multiply a very accurate
 > number with a very approximate number, your result is going to be very
 > approximate, not very accurate! It would thus be more logical to have
 > Float32*Float64 return a Float32!

If numeric precision was all that mattered, then you would be correct.
But numeric range is also important.  I would hate to take the chance
of overflowing the above multiplication because I stored the result as 
a Float32, instead of a Float64, even though the Float64 is overkill
in terms of precision.  FORTRAN has made an attempt to address this
issue in FORTRAN 9X by allowing the user to indicate the range and
precision of the calculation.

 >  PB> In a future version of Python, the behavior of integer division
 >  PB> will change.  The operands will be converted to floats, so the
 >  PB> result will be a float.  If we implement the proposed scalar
 >  PB> coercion rules where arrays have precedence over Python scalars,
 >  PB> then dividing an array by an integer will return an integer array
 >  PB> and will not be consistent with a future version of Python which
 >  PB> would return an array of type double.  Scientific programmers are
 >  PB> familiar with the distinction between integer and float-point
 >  PB> division, so should Numeric 2 continue with this behavior?
 > 
 > Numeric 2 should be as compatible as reasonably possible with core python.
 > But my question is: how would we do integer division of arrays? A ufunc
 > for which no operator shortcut exists?

I don't understand either question.

We have devised a scheme where there are two sets of coercion rules.
One for coercion between array types, and one for array and Python
scalar types.  This latter set of rules can either have higher
precedence for array types or Python scalar types.  We favor array
types having precedence.

A more complex set of coercion rules is also possible, if you prefer.

 >  PB> 7.  How are numerical errors handled (IEEE floating-point errors in
 >  PB> particular)?
 > 
 > I am developing my code on Linux and IRIX. I have seen that where
 > Numeric code on Linux runs fine, the same code on IRIX may "core dump"
 > on a FPE (e.g. arctan2(0,0)). That difference should be avoided.
 > 
 >  PB> a.  Print a message of the most severe error, leaving it to
 >  PB> the user to locate the errors.
 > 
 > What is the most severe error?

Well, divide by zero and overflow come to mind.  Underflows are often
considered less severe.  Yet this is up to you to decide.

 >  PB> c.  Minimall UFunc class:
 > 
 > Typo: Minimal?

Got it!

Thanks for your comments.

I-obviously-have-a-lot-of-explaining-to-do-ly yours,
Paul

-- 
Dr. Paul Barrett       Space Telescope Science Institute
Phone: 410-338-4475    ESS/Science Software Group
FAX:   410-338-4767    Baltimore, MD 21218