[PYTHON MATRIX-SIG] Handling null data points

Andrew P. Mullhaupt amullhau@ix.netcom.com
Wed, 08 Jan 1997 23:18:08 -0500


Duncan Child wrote:
> 
> Thanks for the suggestions. I am not sure that my previous post made it
> clear that I was talking about null data values in the Numerical Extension
> Python array. I have to use the Numeric arrays because I have so much
> data to work with. Still, if there is a more general solution that can also
> be applied outside the Numeric Extension that would be even better.
>

There is. You want to use NaN values with all the floating types. The
only really annoying problem is that no such value is conveniently
available for integer types.

It is enormously convenient to have a special value "NA" for all numeric
types. The semantics are pretty obvious, (you can base them on the IEEE
rules for NaN representations).

A good example of what happens with "NA" is the S language. The only
really weird problem is that for integers they use a value which is not
quite 'MAXINT'.

A good example of what happens if "NA" is not provided is given by the
tons of APL code where the normal approach is a validity mask, (i.e. a
boolean array parallel to an APL array which uses logical values to
indicate if the corresponding value in the APL array is 'valid').

Both approaches are useful, but the "NA" approach leads to better
performance and much better memory usage. Given the complete acceptance
of IEEE arithmetic on all useful platforms (No, a Cray is no longer a
useful platform), there is no real obstacle to implementing the "NA"
approach.

There is a _ton_ of experience in this direction with both approaches
(almost 35 years of APL and almost 30 years of S).
 
 
> NaN sounds interesting - at the moment I just use 1e20. This works fine
> for me as I only have to handle vectors of floats but it would be nice
> to have a solution that would be applicable to other data types.

Actually, the extensive discussion leading to the authoritative IEEE
arithmetic standard makes many things clear, such as why using a
'sufficiently large (small)' value is a bad idea. In the particular
case  of using a large, but otherwise normal value, then what happens if
the data is transformed say, by taking logarithms, and then passed to
another routine which then needs to know if that data is valid.
log(1e20) is not 'crazy' enough when you need to play 'spot the looney'.

The only really difficult aspect of implementing "NA" is what to use for
integer or other types. Here, the hardware was not designed with a
convenient hook for a "NA" value, so lots of arguments are likely to
result. 

In my opinion, an "NA" value is so useful that one will eventually be
provided by somebody in some form, so it is probably important for
people who might have to deal with the consequences of that to think
about the larger issues involved. The only thing worse than having _no_
consistent approach to "NA" is having _more than one_.

Later,
Andrew Mullhaupt

=================
MATRIX-SIG  - SIG on Matrix Math for Python

send messages to: matrix-sig@python.org
administrivia to: matrix-sig-request@python.org
=================