[Matrix-SIG] a full-blown interactive data analysis environment

Joe Harrington jh@oobleck.tn.cornell.edu
Mon, 18 Jan 1999 01:00:22 -0500


In late November/early December I posted some messages about building
a complete analysis environment around Python.  I got 15 responses
offering help on the project, and several emails in the mean time
wondering when things will get under way.  Sorry to those I've kept
waiting!  I posted a pointer to a paper Paul Barrett and I wrote
regarding the direction such a project should take.  There hasn't been
any discussion of it as I had hoped, so I'll summarize it here, hope
interested people will read it (it's not long), and close with a few
administrative-type questions.  The URL is
http://oobleck.tn.cornell.edu/jh/ast/papers/idae96.ps.gz .

Data analysis systems have grown nearly as quickly as the volume of
data has.  15 years ago, we were mostly using home-brew compiled
software.  A few institutes had relatively primitive interpreter
interfaces around which a lot of code was written to do specific
tasks.  There wasn't yet a Perl or Python to standardize around; each
group wrote an interpreter from scratch, usually knowing little or
nothing about how *people* interact with computers.  Like the
computers of the time, they felt great to use then but you wouldn't
touch them now.  Just a few years later, it seemed these environments
were stuck.  Communities made huge investments in these languages in
the form of application code that they depended upon for their bread
and butter.  People had spent years writing code for their projects
and they weren't about to do it again.  In many cases the original
authors were gone and those left behind didn't have the programming
skills or the knowledge of the equipment, experiment, method, or
whatever to do the job.  Improving the languages was impossible
without breaking application code and often the resources weren't
there because those with funding felt that "things worked", and anyone
who would work for them would have to "just deal".  More than a few of
those 80's-vintage interpreters are still in active use today, being
carried along by gigs of really good application code and being cursed
by their users each time they fire the old dogs up.

Python is not the answer to this problem, nor is any current user
interface.  The problem is that interface technology (be it a
compiler, an interpreted/autocompiled language, WYSIAYG, or whatever)
changes too rapidly, and in just 5 years we will look at the current
Python and be glad we're not stuck with that "old dog".  Python may
grow along with the state of the art and may still be a good answer in
the future, or it may not.  If it doesn't, we'll want something else.
For that matter, many will not want Python now.

In contrast, our application code is stable.  There are usually not a
great many "best" ways to solve a problem, and one or a few are
generally favored.  Once someone has written least squares, why write
it again?  It's only necessary to do so if a new algorithm or a new
language comes along.  New algorithms will always require new code.
However, in the past 35 years there have been only three "favored
languages", those being Fortran, C, and C++.  Though there are
hundreds of other languages, many technically superior to these three,
these three have taken all the prizes in terms of universality.  Write
in one and your code will run everywhere, forever, and roughly as fast
as any portable code can run.  Once you can compile it for a given
machine, anyone can link to it and few need care what it's written in.
There is much application code, of course, that is not written in
these languages.  It is precisely that code that got us stuck with
now-obscure environments in the first place.

The solution, then, is to make a strictly-enforced division between
application code and user interface, and resist the temptation to do
anything that would make changing user interfaces difficult.  This
allows smooth migration to new interfaces without abandoning the
investment in application coding.  The tools to do this are very
familiar to this group: SWIG, dynamic linking, and the standard
compilers.  When new UIs come along, people will only need to learn
the conventions of that UI, not how to use the new versions of the
application code, because the application code will develop on its own
(much slower) schedule.  In 10 years, when Zappitall becomes the new
hot language of choice, all we'll need to do is extend SWIG to make
Zappitall wrappers in addition to the languages it currently supports,
rebuild the system, and poof, instead of seeing 500 announcements
saying "I wrapped my favorite library for Zappitall," we'll see one
announcement saying, "all the numerical and graphical packages we've
ever wrapped can now be used directly from Zappitall".

The picture I'm drawing is nothing terribly new:

Application code is in standard, compiled, highly optimizable
languages like Fortran, C, and C++.  Initially we might use some
existing Python numerical code, but we'll always select Fortran, C,
and C++ code if it's available, and we'll expend some effort to
have a compiled-code option for each major application component.

A glue layer describes the application code to the user interface,
preferably in a simple and general way that will be easily adapted to
future user interfaces.

The user interface(s) has all the latest features everyone wants, and
when this is no longer true, newer interfaces can be dropped in.  It
would even be desirable to have several interfaces to begin with.
This would ensure that there weren't any hidden dependencies on the
UI, and would also appeal to a wider audience since people have their
own tastes.  Often people want their interface language to resemble
their favorite coding language, so they can rapidly prototype a piece
of code they will eventually write in a compiled language.

That describes the technical end of it, but there's a lot more to it,
especially in the department of user appeal.  This effort will be
large, and it will only be worth it if it gains wide appeal and is in
use for a long time.  People change software packages if they find
something that is much better technically, is easier to use, or is
cheaper.  It is important that we always keep in mind the question:
"What will make the most people happy?" and that we remember we're
talking about the population at large, not just people who think like
we do.  This is a business mindset: make the customer happy and more
will come to you.  Thus, we'll need to target our packaging from the
beginning to the widest range of users.  This impacts our supported
platforms, the documentation, the release packaging, and the way the
package is constructed in general.

I feel strongly that it should run on as many platforms as possible.
That means PC and Mac, not just Unix.  This makes it accessible to
vastly more people, particularly those without the boon of government
or corporate funding, like high school science and math classes,
amateur scientists, hobbyists, etc.  I recently visited a class that
has some real data acquisition capability.  When they measure g in the
lab by dropping a ball, they get 9.8 m/s^2, not 9.6 or 9.57!!  They
have an ultrasonic transducer to measure the motion, and the data it
takes goes right into a software package that plots it out.  Hundreds
of points per second!  They paid a lot (for a school) to get this
setup, and the software was a good fraction of the price.  There's a
big "market" of users there if we can produce something as easy to use
that is free.  Doing so would be a true contribution to society, and
it would take much less effort on the application end than making
something that would satisfy a professional scientist.  This is just
one of many possibilities once we move beyond Unix.  (For those who
care, I'm a die-hard Unix user myself.  This laptop I'm writing on
doesn't even have a Windows or DOS partition, and I don't know how to
compile a program on a Mac or PC.  Yet I think it's critical that we
support them.)

Documentation is a similar item.  Packages like IDL win big points
with simple, worked examples and pamphlet-sized "getting started"
manuals that in a page or two show a user a plot of her own data.  The
main reference manuals do not showcase the programmers' coding prowess
or attitudes, but rather show usually one simple example and one
complex version of the same example.  They use active and direct
language and there are few if any grammar errors.  In addition to
teaching quickly, such manuals give confidence, somehow, in the people
behind the software, as well as in the software itself.

Release packaging will necessarily be in multiple forms.  We'll have
to support the Unix least-common-denominator, tar, as well as
vendor-specific formats from Red Hat, Debian, Sun, and so on, and of
course formats that work on Macs and PCs.  Automating the creation of
these will be important.  I suspect it isn't hard to do.

General construction is where we will have our hardest discussions.
The most commonly heard complaint about software is how hard it is to
learn to use it.  So, simple designs with powerful features will win
over complicated designs that depend on the user understanding it all
before she can do anything.  Certain things have come to work well.
Most of us are happiest when we see command history based on GNU
Readline, and we get ticked if there isn't command history at all.
This is true even if there's a better package out there.  Remember
Beta video tapes?  It was better than VHS, but if you were looking to
rent movies, VHS was the way to go.  Many decisions will have to be
based on what's popular that works, rather than what's pushing the
technical cutting edge.  This doesn't mean we can't support many
options, but my feeling is that the defaults should be things that are
familiar to as many people as possible.  For example, GNU info is more
powerful than HTML, but people know and love their web browsers and
the browsers are quite adequate for the job of viewing documentation.
So while we can make info docs too, HTML docs should be the showcase
item.

Finally, "recycling".  This is hardly going to be a controversial
item, but it bears mentioning.  We should spend much more time
borrowing than writing.  Wrapping will be our friend.  It's what makes
the difference between a huge commercial effort, where every line is
written in-house so that it can be copyrighted and licensed, and a
15-person effort run in our spare time.  We should always hunt for
free or freeable code before we volunteer to write something new.
Then we spend our precious coding resources producing things that
don't already exist.  That's what free software is all about.

The only really daunting thing about the project to me is the
likelihood that I will not be able to devote even as much time to it
as many who volunteered to help.  Ideally, one or more funded data
projects would take an interest and contribute some people's time.  I
will propose that we focus initially on a very small version 1.  The
main emphases should be catching the docs and distribution
infrastructure up to the code, getting a development site set up, and
so on.  Once we have a small thing of high quality "out there", it
will generate interest and further contributions.

Ok, administrative stuff.  I've sketched out a bit of a vision, but
it's far from a plan.  I'd like to hear thoughts from the list to
refine things, point out pitfalls, and so on.  This should be a fairly
abstract discussion, and not long.  Please don't talk about naming or
even very much about specific packages that version 1 should or
shouldn't have; please focus on the approach.  Then we should decide
whether matrix-sig is the right place for the development discussions,
and move if that's warranted.  We'll take some votes on basic design
issues and how to run ourselves, and then we'll divvy up work and be
off.  There will be many times during the project when we will want
input from a larger group than just those actively working on the
project.  And of course we'll need testers and proofreaders from time
to time.  In those cases we'll post to this list.

There are some embarrassingly grand statements above.  I don't mean to
sound like I want to save the world, yet after a decade and a half of
being frustrated with data analysis, the answers seem so simple and so
hard to find in the real world.  If we just select the best
application packages that are already in use, wrap them (initially)
for Python, document them in a clear fashion that connects with a
variety of thinking styles, and assemble easy-to-install
distributions, we will have laid the foundation of a system that may
well survive us.  That's quite a grand statement in the computer
world!

--jh--