[Matrix-SIG] Interactive Data Analysis

Travis E. Oliphant Oliphant.Travis@mayo.edu
Fri, 15 Jan 1999 18:47:03 -0600


Thanks to Joe Harrington for his desire and vision to get this project
off the ground.  I'm writing to voice my opinion and make a couple of
comments.

I read with quite a bit of interest the paper Joe put on his web page
and the post to this list.  The paper he references is a good discussion
of some important points and I would recommend it for reading.  I think
the current state of numerical analysis in python is very close to much
of the important points in that paper. Overall, what is most lacking is
packaging of all (or a selection of) the great wrappers out there into a
single available and constistent (and tested) set and improvements on
the documentation (including a help system that gets used).

While I agree with the sentiment on supporting Windows (I think Joe
meant Windows, not just PC's, right :-) ) and Mac's, this is just where
the problem lies right now, doesn't it.  The single most needed part of
the system that isn't just a document-and-package step away is a good,
free cross-platform plotting library.   However, there are already some
highly usable plotting packages available for X windows systems.  I find
quite usable both GIST (how is the port to windows coming) and DISLIN
(free for Linux and FreeBSD).  Since UNIX/X or just X is available for
all of this hardware a part of me (the anti-social, revenge-seeking part
admittedly) wants to say "Come back and do numerical analysis with a
real and open OS.  You've already got all kinds of look-pretty,
draw-me-a-graphic, dont-tell-me-how-it-works software for windows
anyway."  Maybe we can get something together with just an eye towards
the PC/MAC world and even then not be too concerned with making
everything.  I'm a little concerned that the quality of our offering
will be affected by an attempt to outfit the Windows-centric world. 
O.K. that's enough rant about that.  I'm not trying to offend people.  I
just think that people interested in a free non-proprietary numerical
analysis package should take the time to put it on an operating system
that doesn't also tie them down....

Now, back to the real world.  Perhaps the best solution is to develop
some kind standard plotting API that allows construction of 2 and 3-D 
plots and then wrap up your favorite plotting library behind an
implementation of the API.  That way people could use (most of) the same
commands and syntax but have possibly different output libraries.  This
API would need to be relatively complete so not every plotting library
would do.  

Actually, this leads to another idea about plotting that I've been
discussing with a co-worker later and that is separating the concepts of
"interactive plotting" and "publication plotting."  In other words with
a plotting API it could be feasible to use different packages to
interact with the data and to print the data, with the commands being
very much the same. 

The second important issue is the interactive environment.  Since I've
added a PYTHONSTARTUP file to my environment I've been quite happy with
the interactive environment I have.  When I startup I have all kinds of
functions and variables defined so it is very easy to "interact" the
data.  (For example I don't call LinearAlgebra.generalized_inverse every
time I want to compute a pseudoinverse.  I use pinv() ).

One question is should we write a new interpreter loop to include some
nifty interactive editing features?  My personal favorite is using up
arrow to bring back previous lines that look like what is typed so far.
(just a "very-handy" keystroke variation on the functionality of GNU
readline.)  

At the very least we need to include somebody's fantastic NumPy
PYTHONSTARTUP file in the distribution.

My final comments have to do with the issue of wrapping libraries.  I
agree that we should minimize our efforts in writing new code. I also
agree that our efforts in organizing this code and any routines we add
to it should not be too Python-centric so that they can be used in other
interactive environments.    It seems to me, however, that what we have
left to do at this point is pretty much all Python-centric, i.e. it's
the wrappings, packaging, and documentation that needs to be done. 

I guess the real question is: Can the wrappings be done in a
non-python-centric way?  Is SWIG really up to the task?  I'm no SWIG
expert and I think it is a wonderful tool, but right now I don't think
it can fully "do the job" in a non Numerical-Python-oriented way.  What
we need is an implementation/extension of SWIG (or maybe, hopefully,
just a good set of examples of how to use SWIG effectively) that is more
"array oriented" or "loop oriented."  I'll share my experience at
wrapping up the FFTW module and the cephes library to explain what I
mean.  The problem is in handling arbitrary Numerical Python Arrays as
inputs and outputs to the wrapped C code. 

The FFTW library wrapping was easy in SWIG.  I made a Numerical-python
centric typemap to an otherwise cross-language SWIG input file that was
basically a copy of the header file from FFTW.  After running swig, I
had a great little interface to the FFTW calls that could take as inputs
(complex) Numerical python arrays, and return the output as a NumPy
array.  Code that takes as arguments pointers to data blocks of memory
can usually be handled in this way with SWIG.  (We definitely need to
put out some good documentation on how to do this though...)

On the other hand wrapping the cephes library in SWIG did not appear
possible (again I don't know everythin about SWIG) for what I wanted to
be able to do.  What was needed was something like the ufuncs that
operate so well an arbitrary sized arrays.  This did not seem possible
to do in SWIG.  So, all of my wrapping was pretty much python-centric. 
In the process I found myself wanting a way to take a routine in C that
has N inputs and M outputs and write a little interface description text
and say "wrap" and produce a wrapper that would allow the procedure to
be called from python with N input numpy arrays of arbitrary (but
consistent) dimensions and return M output numpy arrays. Is this
cephes-specific or would the idea extend to other wrappings? Maybe there
is a way to add to the SWIGlike tools we have to make them more aware of
using libraries in an interactive data analysis library? Maybe that's
too hard and not worth the time?

But, again for now it seems that most of our work will be pretty much in
the python area.  I guess if we do have to write up some important
numerical routine we should do it in C. 

It's been a long post but this is an important area for me.  I say let's
get out there and post suggestions for important libraries to include in
the distribtution, resolve the plotting issues, divvy up and get going
on the documentation, and smooth out the interactive environment edges. 
In the process maybe we can come up with a really snazzy way to do the
wrappings that will let use the most drop-in code in a way that is still
consistent with "array-oriented" environment.  I think we are very close
to something that we can then tell all our colleagues about and have
them jump aboard.

For the record the libraries I see as vitally important are:

ODE solver
Optimization routines
Zero finding routines
Numerical integration

other useful additions

general n-d convolution
extensive linear filter routines
filter design routines
image processing routines
 (we need to get together with the PIL here.  I think the PIL should be
as NumPy centered as possible.  I don't think it is wise to have a bunch
of image processing routines that only work on PIL image objects.  After
all, an image is just a special way of interpreting a matrix (or set of
matrices).  We do not want to duplicate work and implement image filters
for NumPy arrays and PIL objects.  I mainly use PIL to get image formats
into and out of NumPy arrays at the moment.)



Travis