[Numpy-discussion] the direction and pace of development

Thu Jan 22 00:05:01 EST 2004

Good thing Duke is beating Maryland as I read, otherwise, mail like this 
can make you grumpy. :-)

Joe Harrington wrote:

>This is a necessarily long post about the path to an open-source
>replacement for IDL and Matlab.  While I have tried to be fair to
>those who have contributed much more than I have, I have also tried to
>be direct about what I see as some fairly fundamental problems in the
>way we're going about this.  I've given it some section titles so you
>can navigate, but I hope that you will read the whole thing before
>posting a reply.  I fear that this will offend some people, but please
>know that I value all your efforts, and offense is not my intent.
>
>  
>
>THE PAST VS. NOW
>
>While there is significant and dedicated effort going into
>numeric/numarray/scipy, it's becoming clear that we are not
>progressing quickly toward a replacement for IDL and Matlab.  I have
>great respect for all those contributing to the code base, but I think
>the present discussion indicates some deep problems.  If we don't
>identify those problems (easy) and solve them (harder, but not
>impossible), we will continue not to have the solution so many people
>want.  To be convinced that we are doing something wrong at a
>fundamental level, consider that Python was the clear choice for a
>replacement in 1996, when Paul Barrett and I ran a BoF at ADASS VI on
>interactive data analysis environments.  That was over 7 years ago.
>
>  
>
The effort has fallen short of the mark you set.  I also wish the 
community was more efficient at pursuing this goal.  There are 
fundamental issues.  (1) The effort required is large.  (2) Free time is 
in short supply.  (3) Financial support is difficult to come by for 
library development.  Other potential problems would be a lack of 
interest and a lack of competence.  I do not think many of us suffer 
from the first.  As for competence, the development team beyond the 
walls of Enthought self selects in open source projects, so we're stuck 
with what we've got.  I know most of the people and happen to think they 
are a talented bunch, so I'll consider us no worse than the average 
group of PhDs (some consider that a pretty low bar ...).  I believe the 
tasks that go undone (multi-platform support, bi-yearly releases, 
documentation, etc.) are more due to (2) and (3) above instead of some 
other deep (or shallow) issue.

I guess another possibility is organization.  This can be improved 
upon.  Thanks to the gracious help of Cal Tech (CACR) and NCBR, the 
community has gathered at a low cost SciPy workshop at Cal Tech the last 
couple of years.  I believe this is a positive step.  Adding this to the 
newsgroups and mailing lists provides us with a solid framework within 
which to operate.

I still have confidence that we will reach the IDL/Matlab replacement 
point.  We don't have the resources that those products have behind 
them.   We do have a superior language, but  without a lot of sweat and 
toiling at hours of grunt work, we don't stand a chance.  As for 
Enthought's efforts, our success in building applications (scientific 
and otherwise) has diverted  our developers (myself included) away from 
SciPy as the primary focus.  We do continue to develop it and provide 
significant (for us) financial support to maintain it.  I am lucky 
enough to work with a fine set of software engineers, and I am itching 
to for us to get more time devoted to SciPy.  I do believe that we will 
get the opportunity in the future -- it is just a matter of time.  Call 
me an optimist.

>replace IDL or Matlab", the answer was clearly "stable interfaces to
>basic numerics and plotting; then we can build it from there following
>the open-source model".  Work on both these problems was already well
>underway then.  Now, both the numerical and plotting development
>efforts have branched.  There is still no stable base upon which to
>build.  There aren't even packages for popular OSs that people can
>install and play with.  The problem is not that we don't know how to
>do numerics or graphics; if anything, we know these things too well.
>In 1996, if anyone had told us that in 2004 there would be no
>ready-to-go replacement system because of a factor of 4 in small array
>creation overhead (on computers that ran 100x as fast as those then
>available) or the lack of interactive editing of plots at video
>speeds, the response would not have been pretty.  How would you have
>felt?
>
>THE PROBLEM
>
>We are not following the open-source development model.  Rather, we
>pay lip service to it.  Open source's development mantra is "release
>early, release often".  This means release to the public, for use, a
>package that has core capability and reasonably-defined interfaces.
>  
>
>Release it in a way that as many people as possible will get it,
>install it, use it for real work, and contribute to it.  Make the main
>focus of the core development team the evaluation and inclusion of
>contributions from others.  Develop a common vision for the program,
>and use that vision to make decisions and keep efforts focused.
>Include contributing developers in decision making, but do make
>decisions and move on from them.
>
>Instead, there are no packages for general distribution.  The basic
>interfaces are unstable, and not even being publicly debated to decide
>among them (save for the past 3 days).  The core developers seem to
>spend most of their time developing, mostly out of view of the
>potential user base.  I am asked probably twice a week by different
>fellow astronomers when an open-source replacement for IDL will be
>available.  They are mostly unaware that this effort even exists.
>However, this indicates that there are at least hundreds of potential
>contributors of application code in astronomy alone, as I don't nearly
>know everyone.  The current efforts look rather more like the GNU
>project than Linux.  I'm sorry if that hurts, but it is true.
>
>  
>
Speaking from the standpoint of SciPy, all I can say is we've tried to 
do what you outline here.  The effort of releasing the huge load of 
Fortran/C/C++/Python code across multiple platforms is difficult and 
takes many hours.  I would venture that 90% of the effort on SciPy is 
with the build system.  This means that the exact part of the process 
that you are discussing is the majority of the effort.  We keep a 
version for Windows up to date because that is what our current clients 
use.  In all the other categories, we do the best we can and ask others 
to fill the gaps.  It is also worth saying that SciPy works quite well 
for most purposes once built -- we and others use it daily on commercial 
projects.

>I know that Perry's group at STScI and the fine folks at Enthought
>will say they have to work on what they are being paid to work on.
>Both groups should consider the long term cost, in dollars, of
>spending those development dollars 100% on coding, rather than 50% on
>coding and 50% on outreach and intake.  Linus himself has written only
>a small fraction of the Linux kernel, and almost none of the
>applications, yet in much less than 7 years Linux became a viable
>operating system, something much bigger than what we are attempting
>here.  He couldn't have done that himself, for any amount of money.
>We all know this.
>  
>
Elaborate on the outreach idea for me.  Enthought (spend money to) 
provide funding to core developers outside of our company (Travis and 
Pearu), we (spend money to) give talks at many conferences a year, we 
(spend a little money to) co-sponsor a 70 person workshop on scientific 
computing every year, we have an open mailing list, we release most of 
the general software that we write, in the past I practically begged 
people to have CVS write access when they provide a patch to SciPy.  We 
even spent a lot of time early on trying to set up the scipy.org site as 
a collaborative Zope based environment -- an effort that was largely a 
failure.  Still we have a functioning largely static site, the mailing 
list, and CVS.  As far as tools, that should be sufficient.

It is impossible to argue with the results though.  Linus pulled off the 
OS model, and Enthought and the SciPy community, thus far, has been less 
successful.  If there are suggestions beyond "spend more *time* 
answering email," I am all ears.  Time is the most precious commodity of 
all these days. 

Also, SciPy has only been around for 3+ years, so I guess we still have 
a some rope left.  I continue to believe it'll happen -- this seems like 
the perfect project for open source contributions.

>THE PATH
>
>Here is what I suggest:
>
>1. We should identify the remaining open interface questions.  Not,
>   "why is numeric faster than numarray", but "what should the syntax
>   of creating an array be, and of doing different basic operations".
>   If numeric and numarray are in agreement on these issues, then we
>   can move on, and debate performance and features later.
>  
>
?? I don't get this one.  This interface (at least for numarray) is 
largely decided.  We have argued the points, and Perry et. al. at STSci 
made the decisions.  I didn't like some of them, and I'm sure everyone 
else had at least one thing they wished was changed, but that is the way 
this open stuff works. 

It is not the interface but the implementation that started this furor.  
Travis O.'s suggestion was to back port (much of) the numarray interface 
to the Numeric code base so that those stuck supporting large co debases 
(like SciPy) and needing fast small arrays could benefit from the 
interface enhancements.  One or two of them had backward compatibility 
issues with Numeric, so he asked how it should be handled.  Unless some 
magic porting fairy shows up, SciPy will be a Numeric only tool for the 
next year or so.  This means that users of SciPy either have to forgo 
some of these features or back port.

On speed:  <excerpt from private mail to Perry>
Numeric is already too slow -- we've had to recode a number of routines 
in C that I don't think we should have in a recent project.  For us, the 
goal is not to approach Numeric's speed but to significantly beat it for 
all array sizes.  That has to be a possibility for any replacement.  
Otherwise, our needs (with the exception of a few features) are already 
better met by Numeric.  I have some worries about all of the endianness 
and memory mapped support that are built into Numarray imposing to much 
overhead for speed-ups on small arrays to be possible (this echo's 
Travis O's thoughts -- we will happily be proven wrong).  None of our 
current work needs these features, and paying a price for them is hard 
to do with an alternative already there.  It is fairly easy to improve 
its performance on mathematical by just changing the way the ufunc 
operations are coded.  With some reasonably simple changes, Numeric 
should be comparable (or at least closer) to Numarray speed for large 
arrays.  Numeric also has a large number of other optimizations that can 
be made (memory is zeroed twice in zeros(), asarray was recently 
improved significantly for the typical case, etc.).  Making these 
changes would help our selling of Python and, since we have at least a 
years worth of applications that will be on the SciPy/Numeric platform, 
it will also help the quality of these applications.

Oh yeah, I have also been surprised at how much of out code uses 
alltrue(), take(), isnan(), etc.  The speed of these array manipulation 
methods is really important for us.

>2. We should identify what we need out of the core plotting
>   capability.  Again, not "chaco vs. pyxis", but the list of
>   requirements (as an astronomer, I very much like Perry's list).
>  
>
Yep, we obviously missed on this one.  Chaco (and the related libraries) 
is extremely advanced in some areas but lags in ease-of-use.  It is 
primarily written by a talented and experienced computer scientist (Dave 
Morrill) who likely does not have the perspective of an astronomer.  It 
is clear that areas of the library need to be re-examined, simplified, 
and improved.  Unfortunately, there is not time for us to do that right 
now, and the internals have proven to complex for others to contribute 
to in a meaningful way.  I do not know when this will be addressed.  The 
sad thing here is that  STSci won't be using it.  That pains me to no 
end, and Perry and I have tried to figure out some way to make it work 
for them.  But, it sounds like, at least in the short term, there will 
be two new additions to the plotting stable.  We will work hard though 
to make the future Chaco solve STSci's problems (and everyone elses) 
better than it currently does. 

By the way, there is a lot of Chaco bashing going on.  It is worth 
saying that we use Chaco every day in commercial applications that 
require complex graphics and heavy interactivity with great success.  
But, we also have mixed teams of scientists and computer scientists 
along with the "U Manual" (If I have a question, I ask you -- being 
Dave) to answer any questions.  I continue to believe Chaco's Traits 
based approach is the only one currently out there that has the chance 
of improving on Matlab and other plotting packages available.   And, 
while SciPy is moving slowly, Chaco is moving at a frantic development 
pace and gets new capabilities daily (which is part of the complaints 
about it).  I feel certain in saying that it has more resources tied to 
its development that the other plotting option out there -- it is just 
currently being exercised in GUI environments instead of as a day-to-day 
plotting tool.  My advice is dig in, learn traits, and learn Chaco. 

>3. We should collect or implement a very minimal version of the
>   featureset, and document it well enough that others like us can do
>   simple but real tasks to try it out, without reading source code.
>   That documentation should include lists of things that still need
>   to be done.
>  
>
>4. We should release a stand-alone version of the whole thing in the
>   formats most likely to be installed by users on the four most
>   popular OSs: Linux, Windows, Mac, and Solaris.  For Linux, this
>   means .rpm and .deb files for Fedora Core 1 and Debian 3.0r2.
>   Tarballs and CVS checkouts are right out.  We have seen that nobody
>   in the real world installs them.  To be most portable and robust,
>   it would make sense to include the Python interpreter, named such
>   that it does not stomp on versions of Python in the released
>   operating systems.  Static linking likewise solves a host of
>   problems and greatly reduces the number of package variants we will
>   have to maintain.
>
>5. We should advertize and advocate the result at conferences and
>   elsewhere, being sure to label it what it is: a first-cut effort
>   designed to do a few things well and serve as a platform for
>   building on.  We should also solicit and encourage people either to
>   work on the included TODO lists or to contribute applications.  One
>   item on the TODO list should be code converters from IDL and Matlab
>   to Python, and compatibility libraries.
>
>6. We should then all continue to participate in the discussions and
>   development efforts that appeal to us.  We should keep in mind that
>   evaluating and incorporating code that comes in is in the long run
>   much more efficient than writing the universe ourselves.
>
>7. We should cut and package new releases frequently, at least once
>   every six months.  It is better to delay a wanted feature by one
>   release than to hold a release for a wanted feature.  The mountain
>   is climbed in small steps.
>
>The open source model is successful because it follows closely
>something that has worked for a long time: the scientific method, with
>its community contributions, peer review, open discussion, and
>progress mainly in small steps.  Once basic capability is out there,
>we can twiddle with how to improve things behind the scenes.
>
>  
>
Everything here is great -- it is the implementation part that is hard.  
I am all for it happening though.

>IS SCIPY THE WAY?
>
>The recipe above sounds a lot like SciPy.  SciPy began as a way to
>integrate the necessary add-ons to numeric for real work.  It was
>supposed to test, document, and distribute everything together.  I am
>aware that there are people who use it, but the numbers are small and
>they seem to be tightly connected to Enthought for support and
>application development.  
>
Not so.   The user base is not huge, but I would conservatively venture 
to say it is in the hundreds to thousands.  We are a company of 12 
without a single support contract for SciPy.

>Enthought's focus seems to be on servicing
>its paying customers rather than on moving SciPy development along,
>  
>
Continuing to move SciPy along at the pace we initially were would have 
ended Enthought -- something had to change.  It is surprising how 
important paying customers are to a company.

>and I fear they are building an installed customer base on interfaces
>that were not intended to be stable.
>  
>
Not sure what you you mean here, but I'm all for stable interfaces.  
Huge portions of SciPy's interface haven't changed, and I doubt they 
will change.  I do indeed feel, though, that SciPy is still a 0.2 
release level, so some of the interfaces can change.  It would be 
irresponsible to say otherwise.  This is not "intentionally unstable" 
though...

>So, I will raise the question: is SciPy the way?  Rather than forking
>the plotting and numerical efforts from what SciPy is doing, should we
>not be creating a new effort to do what SciPy has so far not
>delivered?  These are not rhetorical or leading questions.  I don't
>know enough about the motivations, intentions, 
>
Man this sounds like an interview (or interaction) question.  We'll 
we're a company, so we do wish to make money -- otherwise, we'll have to 
do something else.  We also care about deeply about science and are 
passionate about scientific computing.  Let see, what else.  We have 
made most of the things we do open source because we do believe in it in 
principle and as a good development philosophy.  And, even though we all 
wish SciPy was moving faster, SciPy wouldn't be anywhere close to where 
it is without Travis Oliphant and Pearu Peterson -- neither of whom 
would have worked on it had it not been openly available.  That alone 
validates the decision to make it open.

I'm not sure what we have done to make someone question our "motivations 
and intentions" (sounds like a date interrogation), but it is hard to 
think of malicious ones when you are making the fruits of your labors 
and dollars freely available.

>and resources of the
>  
>
Well, we have 12 people, and Pearu and Travis O work with us quite a bit 
also.  The developers here are very good (if I do say so myself), but 
unfortunately primarily working on other projects at the moment.  
Besides scientists/computer scientists have a technical writer and a 
human-computer-interface specialist on staff.

>folks at Enthought (and elsewhere) to know the answer.  I do think
>that such a fork will occur unless SciPy's approach changes
>substantially.  
>
Enthought has more commitments than we used to.  SciPy remains important 
and core to what we do, it just has to share time with other things.  
Luckily Pearu and Travis have kept there ear to the ground to help out 
people on the mailing lists as well as working on the codebase.

I'm not sure what our approach has been that would force a fork... It 
isn't like someone has come as asked to be release manager, offered to 
keep the web pages up to date, provided peer review of code, etc and we 
have turned them away.  Almost from the beginning most effort is 
provided by a small team (fairly standard for OS stuff).  We have 
repeatedly pointed out areas we need help at the conference and in mail 
-- code reviews, build help, release help, etc.  In fact, I double dare 
ya to ask to manage the next release or the documentation effort.  
okay... triple dare ya.

Some people have philosophical (like Konrad I believe) differences with 
how SciPy is packaged and believe it should be 12 smaller packages 
instead of one large one.  This has its own set of problems obviously, 
but forking based on this kind of principle would make at least a 
modicum of sense. 

Forking because you don't like the pace of the project makes zero 
sense.  Pitch in and solve the problem.  The social barriers are very 
small.  The code barriers (build, etc.) are what need to be solved.

>The way to decide is for us all to discuss the
>question openly on these lists, and for those willing to participate
>and contribute effort to declare so openly.  I think all that is
>needed, either to help SciPy or replace it, is some leadership in the
>direction outlined above.  I would be interested in hearing, perhaps
>from the folks at Enthought, alternative points of view.  Why are
>there no packages for popular OSs for SciPy 0.2?  
>
Please build them, ask for web credentials, and up load them.  Then 
answer the questions people have about them on the mailing list.  It is 
as simple as that.  There is no magic here -- just work.

>Why are releases so
>infrequent?  
>
Ditto.

>If the folks running the show at scipy.org disagree with
>many others on these lists, then perhaps those others would like to
>roll their own.  Or, perhaps stable/testing/unstable releases of the
>whole package are in order.
>
>HOW TO CONTRIBUTE?
>
>Judging by the number of PhDs in sigs, there are a lot of researchers
>on this list.  I'm one, and I know that our time for doing core
>development or providing the aforementioned leadership is very
>limited, if not zero.
>
Surprisingly, commercial developers have about the same amount of free time.

>  Later we will be in a much better position to
>contribute application software.  However, there is a way we can
>contribute to the core effort even if we are not paid, and that is to
>put budget items in grant and project proposals to support the work of
>others.  
>
For the academics, supporting a *dedicated* student to maintain SciPy 
would be much more cost effective use of your dollars.  Unfortunately, 
it is hard to get a PhD for supporting SciPy...

<begin shameless plugs that somehow seem appropriate here>
For companies, national laboratories, etc.  Supporting development on 
SciPy (or numarray) directly is a great idea.  Projects that we work on 
in other areas also indirectly support SciPy, Chaco, etc. so get us 
involved with the development efforts at your company/lab.

Other options?  Government (NASA, Military, NIH, etc) and national lab 
people can get SciPy/numarray/Python related SBIR 
(http://www.acq.osd.mil/sadbu/sbir/) topics that would impact there 
research/development put on the solicitation list this summer.  Email me 
if you have any questions on this.  ASCI people can propose PathForward 
projects.   There are probably numerous other ways to do this.  We will 
have a GSA schedule soon, so government contracting will also work.
</end shameless plug>

>subcontractors at places like Enthought or STScI.  A handful of
>contributors would be all we'd need to support someone to produce OS
>packages and tutorial documentation (the stuff core developers find
>boring) for two releases a year.
>  
>
Joe, as you say, things haven't gone as fast as any of us would wish, 
but it hasn't been for lack of trying.  Many of us have put zillions of 
hours into this.  The results are actually quite stable tools.  Many 
people use Numeric/Numarray/SciPy in daily work without problems.  But, 
like Linux in the early years, they still require "geeks" willing to do 
some amount of meddling to use them.  Huge resources (developer and 
financial) have been pumped into Linux to get it to the point its at 
today.  Anything we can do to increase the participation in building 
tools and financially supporting those who do build tools, I am all 
for...  I'd love to see releases on 10 platforms and full documentation 
for the libraries as well as the next person.

Whew, and Duke managed to hang on and win. 

my .01 worth,
eric

>--jh--
>
>
>-------------------------------------------------------
>The SF.Net email is sponsored by EclipseCon 2004
>Premiere Conference on Open Tools Development and Integration
>See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
>http://www.eclipsecon.org/osdn
>_______________________________________________
>Numpy-discussion mailing list
>Numpy-discussion at lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/numpy-discussion
>  
>