[Matrix-SIG] Re: a full-blown interactive data analysis environment

Joe Harrington jh@oobleck.tn.cornell.edu
Tue, 9 Feb 1999 20:34:09 -0500

> Sorry, but I still don't see what you are planning to do...
> My impression is that generic low-level libraries already exist.
> How exactly does your idea differ from combining LAPACK, FFTPACK, etc.
> into one tar file and write wrappers and documentation for all of it?

That is precisely what I want to do.  The shocking thing is that
nobody has done this yet, at least not in a usable way.  Doing it well
is not trivial.

The difference from what has already been done is to do it coherently,
package it well, and provide a place for users and developers to get
it, contribute to it, discuss it, etc., following the model of other
successful free software efforts.  So far, scientific computing has
not taken part in the explosion of capability (and drop in price) that
other kinds of computing have as a result of the open-source software
development model.  The rank and file scientists in most disciplines
use commercial software or hire programmers to write in C.  There are
several decent low-level libraries (and many indecent ones), but there
is no integration.

This project will play the same role as Red Hat or Debian do for
Linux.  Linux didn't take off until there were coherent distributions
that were easy to install and easy for hackers to add to.  Prior to
then, you had to go out and find and compile everything in /usr/bin.
Few did.  Now anyone can go grab a CD and install Linux in 15 minutes.
When you're done, there's a coherent system whose components interact
well.  Now, even if you went through the huge effort of collecting
lots of numerical stuff, getting it to compile on your platform, and
wrapping it, you'd have huge gaps where you couldn't find a package,
the packages wouldn't interact cleanly, and you'd be faced with a
million pages of really awful docs, with no examples of how to use the
stuff together.  You could spend several years full-time getting to
this point, and still not be able to work as easily as you can in a
primitive environment like IDL or even IRAF.

The way I see changing the situation is to decide first what we want
(rather than listing the packages that are available), then going out
and looking for all the free implementations of each component,
weighing each, and selecting packages based on how well they fit into
the overall scheme.  This would include their ease of use, quality of
implementation, level of support, documentation, use of data types
that most match our standard ones, etc.  We'd then pick one to use and
write what we needed to get it to interface well.  That might be a
SWIG wrapper or a piece of C/C++ code that translated the data to our
interchange formats.  We'd write tutorial docs that described using
the routines together, with examples.  Where needed, we'd write docs
that described packages whose documentation was deficient.  Finally,
we'd package the whole thing up so that it's easy for a novice to
install it.

Of course, for novices to install and begin using it, the main package
has to include at least one interactive environment.  I chose Python
because it's the clear winner among the few languages that handle
arrays.  Such a package would make Python usable for science by anyone
who can use IDL, AIPS, IRAF, etc., without spending months struggling
against the lack of good numerical routines and graphical display.
The assembled package would be free and it would quickly become better
than its competition.  People in various disciplines (including some
on this list, I'm sure) would soon contribute stuff to support work in
specific fields.

I hope this better explains what I'm getting at.  If you have more
questions and haven't done so already, please take the time to read
our article and visit the Interactive Data Analysis Environments web
site, if you haven't:



Joe Harrington
326 Space Sciences Building
Cornell University
Ithaca, NY 14853-6801
(607) 255-5913 office
(607) 255-9002 fax
jh@alum.mit.edu (permanent)