the direction and pace of development
This is a necessarily long post about the path to an open-source replacement for IDL and Matlab. While I have tried to be fair to those who have contributed much more than I have, I have also tried to be direct about what I see as some fairly fundamental problems in the way we're going about this. I've given it some section titles so you can navigate, but I hope that you will read the whole thing before posting a reply. I fear that this will offend some people, but please know that I value all your efforts, and offense is not my intent. THE PAST VS. NOW While there is significant and dedicated effort going into numeric/numarray/scipy, it's becoming clear that we are not progressing quickly toward a replacement for IDL and Matlab. I have great respect for all those contributing to the code base, but I think the present discussion indicates some deep problems. If we don't identify those problems (easy) and solve them (harder, but not impossible), we will continue not to have the solution so many people want. To be convinced that we are doing something wrong at a fundamental level, consider that Python was the clear choice for a replacement in 1996, when Paul Barrett and I ran a BoF at ADASS VI on interactive data analysis environments. That was over 7 years ago. When people asked at that conference, "what does Python need to replace IDL or Matlab", the answer was clearly "stable interfaces to basic numerics and plotting; then we can build it from there following the open-source model". Work on both these problems was already well underway then. Now, both the numerical and plotting development efforts have branched. There is still no stable base upon which to build. There aren't even packages for popular OSs that people can install and play with. The problem is not that we don't know how to do numerics or graphics; if anything, we know these things too well. In 1996, if anyone had told us that in 2004 there would be no ready-to-go replacement system because of a factor of 4 in small array creation overhead (on computers that ran 100x as fast as those then available) or the lack of interactive editing of plots at video speeds, the response would not have been pretty. How would you have felt? THE PROBLEM We are not following the open-source development model. Rather, we pay lip service to it. Open source's development mantra is "release early, release often". This means release to the public, for use, a package that has core capability and reasonably-defined interfaces. Release it in a way that as many people as possible will get it, install it, use it for real work, and contribute to it. Make the main focus of the core development team the evaluation and inclusion of contributions from others. Develop a common vision for the program, and use that vision to make decisions and keep efforts focused. Include contributing developers in decision making, but do make decisions and move on from them. Instead, there are no packages for general distribution. The basic interfaces are unstable, and not even being publicly debated to decide among them (save for the past 3 days). The core developers seem to spend most of their time developing, mostly out of view of the potential user base. I am asked probably twice a week by different fellow astronomers when an open-source replacement for IDL will be available. They are mostly unaware that this effort even exists. However, this indicates that there are at least hundreds of potential contributors of application code in astronomy alone, as I don't nearly know everyone. The current efforts look rather more like the GNU project than Linux. I'm sorry if that hurts, but it is true. I know that Perry's group at STScI and the fine folks at Enthought will say they have to work on what they are being paid to work on. Both groups should consider the long term cost, in dollars, of spending those development dollars 100% on coding, rather than 50% on coding and 50% on outreach and intake. Linus himself has written only a small fraction of the Linux kernel, and almost none of the applications, yet in much less than 7 years Linux became a viable operating system, something much bigger than what we are attempting here. He couldn't have done that himself, for any amount of money. We all know this. THE PATH Here is what I suggest: 1. We should identify the remaining open interface questions. Not, "why is numeric faster than numarray", but "what should the syntax of creating an array be, and of doing different basic operations". If numeric and numarray are in agreement on these issues, then we can move on, and debate performance and features later. 2. We should identify what we need out of the core plotting capability. Again, not "chaco vs. pyxis", but the list of requirements (as an astronomer, I very much like Perry's list). 3. We should collect or implement a very minimal version of the featureset, and document it well enough that others like us can do simple but real tasks to try it out, without reading source code. That documentation should include lists of things that still need to be done. 4. We should release a stand-alone version of the whole thing in the formats most likely to be installed by users on the four most popular OSs: Linux, Windows, Mac, and Solaris. For Linux, this means .rpm and .deb files for Fedora Core 1 and Debian 3.0r2. Tarballs and CVS checkouts are right out. We have seen that nobody in the real world installs them. To be most portable and robust, it would make sense to include the Python interpreter, named such that it does not stomp on versions of Python in the released operating systems. Static linking likewise solves a host of problems and greatly reduces the number of package variants we will have to maintain. 5. We should advertize and advocate the result at conferences and elsewhere, being sure to label it what it is: a first-cut effort designed to do a few things well and serve as a platform for building on. We should also solicit and encourage people either to work on the included TODO lists or to contribute applications. One item on the TODO list should be code converters from IDL and Matlab to Python, and compatibility libraries. 6. We should then all continue to participate in the discussions and development efforts that appeal to us. We should keep in mind that evaluating and incorporating code that comes in is in the long run much more efficient than writing the universe ourselves. 7. We should cut and package new releases frequently, at least once every six months. It is better to delay a wanted feature by one release than to hold a release for a wanted feature. The mountain is climbed in small steps. The open source model is successful because it follows closely something that has worked for a long time: the scientific method, with its community contributions, peer review, open discussion, and progress mainly in small steps. Once basic capability is out there, we can twiddle with how to improve things behind the scenes. IS SCIPY THE WAY? The recipe above sounds a lot like SciPy. SciPy began as a way to integrate the necessary add-ons to numeric for real work. It was supposed to test, document, and distribute everything together. I am aware that there are people who use it, but the numbers are small and they seem to be tightly connected to Enthought for support and application development. Enthought's focus seems to be on servicing its paying customers rather than on moving SciPy development along, and I fear they are building an installed customer base on interfaces that were not intended to be stable. So, I will raise the question: is SciPy the way? Rather than forking the plotting and numerical efforts from what SciPy is doing, should we not be creating a new effort to do what SciPy has so far not delivered? These are not rhetorical or leading questions. I don't know enough about the motivations, intentions, and resources of the folks at Enthought (and elsewhere) to know the answer. I do think that such a fork will occur unless SciPy's approach changes substantially. The way to decide is for us all to discuss the question openly on these lists, and for those willing to participate and contribute effort to declare so openly. I think all that is needed, either to help SciPy or replace it, is some leadership in the direction outlined above. I would be interested in hearing, perhaps from the folks at Enthought, alternative points of view. Why are there no packages for popular OSs for SciPy 0.2? Why are releases so infrequent? If the folks running the show at scipy.org disagree with many others on these lists, then perhaps those others would like to roll their own. Or, perhaps stable/testing/unstable releases of the whole package are in order. HOW TO CONTRIBUTE? Judging by the number of PhDs in sigs, there are a lot of researchers on this list. I'm one, and I know that our time for doing core development or providing the aforementioned leadership is very limited, if not zero. Later we will be in a much better position to contribute application software. However, there is a way we can contribute to the core effort even if we are not paid, and that is to put budget items in grant and project proposals to support the work of others. Those others could be either our own employees or subcontractors at places like Enthought or STScI. A handful of contributors would be all we'd need to support someone to produce OS packages and tutorial documentation (the stuff core developers find boring) for two releases a year. --jh--
On 21.01.2004, at 19:44, Joe Harrington wrote:
This is a necessarily long post about the path to an open-source replacement for IDL and Matlab. While I have tried to be fair to
You raise many good points here. Some comments:
those who have contributed much more than I have, I have also tried to be direct about what I see as some fairly fundamental problems in the way we're going about this. I've given it some section titles so you
I'd say the fundamental problem is that "we" don't exist as a coherent group. There are a few developer groups (e.g. at STSC and Enthought) who write code primarily for their own need and then make it available. The rest of us are what one could call "power users": very interested in the code, knowledgeable about its use, but not contributing to its development other than through testing and feedback.
THE PROBLEM
We are not following the open-source development model. Rather, we
True. But is it perhaps because that model is not so well adapted to our situation? If you look at Linux (the OpenSource reference), it started out very differently. It was a fun project, done by hobby programmers who shared an idea of fun (kernel hacking). Linux was not goal-oriented in the beginnings. No deadlines, no usability criteria, but lots of technical challenges. Our situation is very different. We are scientists and engineers who want code to get our projects done. We have clear goals, and very limited means, plus we are mostly somone's employees and thus not free to do as we would like. On the other hand, our project doesn't provide the challenges that attract the kind of people who made Linux big. You don't get into the news by working on NumPy, you don't work against Microsoft, etc. Computational science and engineering just isn't the same as kernel hacking. I develop two scientific Python libraries myself, more specialized and thus with a smaller market share, but the situation is otherwise similar. And I work much like the Numarray people do: I write the code that I need, and I invest minimal effort in distribution and marketing. To get the same code developped in the Linux fashion, there would have to be many more developers. But they just don't exist. I know of three people worldwide whose competence in both Python/C and in the application domain is good enough that they could work on the code base. This is not enough to build a networked development community. The potential NumPy community is certainly much bigger, but I am not sure it is big enough. Working on NumPy/Numarray requires the combination of not-so-frequent competences, plus availability. I am not saying it can't be done, but it sure isn't obvious that it can be.
Release it in a way that as many people as possible will get it, install it, use it for real work, and contribute to it. Make the main focus of the core development team the evaluation and inclusion of contributions from others. Develop a common vision for the program,
This requires yet different competences, and thus different people. It takes people who are good at reading others' code and communicating with them about it. Some people are good programmers, some are good scientists, some are good communicators. How many are all of that - *and* available?
I know that Perry's group at STScI and the fine folks at Enthought will say they have to work on what they are being paid to work on. Both groups should consider the long term cost, in dollars, of spending those development dollars 100% on coding, rather than 50% on coding and 50% on outreach and intake. Linus himself has written only
You are probably right. But does your employer think long-term? Mine doesn't.
applications, yet in much less than 7 years Linux became a viable operating system, something much bigger than what we are attempting
Exactly. We could be too small to follow the Linux way.
1. We should identify the remaining open interface questions. Not, "why is numeric faster than numarray", but "what should the syntax of creating an array be, and of doing different basic operations".
Yes, a very good point. Focus on the goal, not on the legacy code. However, a technical detail that should not be forgotten here: NumPy and Numarray have a C API as well, which is critical for many add-ons and applications. A C API is more closely tied to the implementation than a Python API. It might thus be difficult to settle on an API and then work on efficient implementations.
2. We should identify what we need out of the core plotting capability. Again, not "chaco vs. pyxis", but the list of requirements (as an astronomer, I very much like Perry's list).
100% agreement. For plotting, defining the interface should be easier (no C stuff). Konrad.
I would like to thank the contributors to the discussion as I think one of the problems we have had lately is that people haven't been talking much. Partly because we have some fundamental differences of opinion caused by different goals and partly because we are all busy working on a variety of other pressing projects. The impression has been that Numarray will replace Numeric. I agree with Perry that this has always been less of a consensus and more of a hope. I am more than happy for Numarray to replace Numeric as long as it doesn't mean all my code slows down. I would say the threshold is that my code can't slow down by more than a factor of 10%. If there is a code-base out there (Numeric) that can allow my code to run 10% faster it will get used. I also don't think it's ideal to have multiple N-D arrays running around there, but if they all have the same interface then it doesn't really matter. The two major problems I see with Numarray replacing Numeric are 1) How is UFunc support? Can you create ufuncs in C easily (with a single function call or something similar). 2) Speed for small arrays (array creation is the big one). It is actually quite a common thing to have a loop during which many small arrays get created and destroyed. Yes, you can usually make such code faster by "vectorizing" (if you can figure out how). But the average scientist just wants to (and should be able to) just write a loop. Regarding speed issues. Actually, there are situations where I am very unsatisfied with Numeric's speed performance and so the goal for Numarray should not be to achieve some percentage of Numeric's performance but to beat it. Frankly, I don't see how you can get speed that I'm talking about by carrying around a lot of extras like byte-swapping support, memory-mapping support, record-array support. *Question*: Is there some way to turn on a flag in Numarray so that all of the extra stuff is ignored (i.e. create a small-array that looks on a binary level just like a Numeric array) ? It would seem to me that this is the only way that the speed issue will go away. Given that 1) Numeric already works and given that all of my code depends on it 2) Numarray doesn't seem to have support for general purpose ufunctions (can the scipy.special package be ported to numarray?) 3) Numarray is slower for the common tasks I end up using SciPy for and 4) I actually understand the Numeric code base quite well I have a hard time justifying switching over to Numarray. Thanks again for the comments. -Travis O. Konrad Hinsen wrote:
On 21.01.2004, at 19:44, Joe Harrington wrote:
This is a necessarily long post about the path to an open-source replacement for IDL and Matlab. While I have tried to be fair to
You raise many good points here. Some comments:
those who have contributed much more than I have, I have also tried to be direct about what I see as some fairly fundamental problems in the way we're going about this. I've given it some section titles so you
I'd say the fundamental problem is that "we" don't exist as a coherent group. There are a few developer groups (e.g. at STSC and Enthought) who write code primarily for their own need and then make it available. The rest of us are what one could call "power users": very interested in the code, knowledgeable about its use, but not contributing to its development other than through testing and feedback.
THE PROBLEM
We are not following the open-source development model. Rather, we
True. But is it perhaps because that model is not so well adapted to our situation? If you look at Linux (the OpenSource reference), it started out very differently. It was a fun project, done by hobby programmers who shared an idea of fun (kernel hacking). Linux was not goal-oriented in the beginnings. No deadlines, no usability criteria, but lots of technical challenges.
Our situation is very different. We are scientists and engineers who want code to get our projects done. We have clear goals, and very limited means, plus we are mostly somone's employees and thus not free to do as we would like. On the other hand, our project doesn't provide the challenges that attract the kind of people who made Linux big. You don't get into the news by working on NumPy, you don't work against Microsoft, etc. Computational science and engineering just isn't the same as kernel hacking.
I develop two scientific Python libraries myself, more specialized and thus with a smaller market share, but the situation is otherwise similar. And I work much like the Numarray people do: I write the code that I need, and I invest minimal effort in distribution and marketing. To get the same code developped in the Linux fashion, there would have to be many more developers. But they just don't exist. I know of three people worldwide whose competence in both Python/C and in the application domain is good enough that they could work on the code base. This is not enough to build a networked development community. The potential NumPy community is certainly much bigger, but I am not sure it is big enough. Working on NumPy/Numarray requires the combination of not-so-frequent competences, plus availability. I am not saying it can't be done, but it sure isn't obvious that it can be.
Release it in a way that as many people as possible will get it, install it, use it for real work, and contribute to it. Make the main focus of the core development team the evaluation and inclusion of contributions from others. Develop a common vision for the program,
This requires yet different competences, and thus different people. It takes people who are good at reading others' code and communicating with them about it. Some people are good programmers, some are good scientists, some are good communicators. How many are all of that - *and* available?
I know that Perry's group at STScI and the fine folks at Enthought will say they have to work on what they are being paid to work on. Both groups should consider the long term cost, in dollars, of spending those development dollars 100% on coding, rather than 50% on coding and 50% on outreach and intake. Linus himself has written only
You are probably right. But does your employer think long-term? Mine doesn't.
applications, yet in much less than 7 years Linux became a viable operating system, something much bigger than what we are attempting
Exactly. We could be too small to follow the Linux way.
1. We should identify the remaining open interface questions. Not, "why is numeric faster than numarray", but "what should the syntax of creating an array be, and of doing different basic operations".
Yes, a very good point. Focus on the goal, not on the legacy code. However, a technical detail that should not be forgotten here: NumPy and Numarray have a C API as well, which is critical for many add-ons and applications. A C API is more closely tied to the implementation than a Python API. It might thus be difficult to settle on an API and then work on efficient implementations.
2. We should identify what we need out of the core plotting capability. Again, not "chaco vs. pyxis", but the list of requirements (as an astronomer, I very much like Perry's list).
100% agreement. For plotting, defining the interface should be easier (no C stuff).
Konrad.
------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Travis Oliphant writes:
The two major problems I see with Numarray replacing Numeric are
1) How is UFunc support? Can you create ufuncs in C easily (with a single function call or something similar).
Different, but I don't think it is difficult to add ufuncs (and probably easier if many types must be supported, though I doubt that is much of an issue for most mathematical functions which generally are only needed for the float types and perhaps complex).
2) Speed for small arrays (array creation is the big one).
This is the much harder issue. I do wonder if it is possible to make numarray any faster than Numeric on this point (or as other later mention, whether the complexity that it introduces is worth it.
It is actually quite a common thing to have a loop during which many small arrays get created and destroyed. Yes, you can usually make such code faster by "vectorizing" (if you can figure out how). But the average scientist just wants to (and should be able to) just write a loop.
I'll pick a small bone here. Well, yes, and I could say that a scientist should be able to write loops that iterate over all array elements and expect that they run as fast. But they can't. After all, using an array language within an interpreted language implies that users must cast their problems into array manipulations for it to work efficiently. By using Numeric or numarray they *must* buy into vectorizing at some level. Having said that, it certainly is true that there are problems with small arrays that cannot be easily vectorized by combining into higher dimension arrays (I think the two most common cases are with variable-sized small arrays or where there are iterative algorithms on small arrays that must be iterated many times (though some of these problems can be cast into larger vectors, but often not really easily).
Regarding speed issues. Actually, there are situations where I am very unsatisfied with Numeric's speed performance and so the goal for Numarray should not be to achieve some percentage of Numeric's performance but to beat it.
Frankly, I don't see how you can get speed that I'm talking about by carrying around a lot of extras like byte-swapping support, memory-mapping support, record-array support.
You may be right. But then I would argue that if one want to speed up small array performance, one should really go for big improvements. To do that suggests taking a signifcantly different approach than either Numeric or numarray. But that's a different topic ;-) To me, factors of a few are not necessarily worth the trouble (and I wonder how much of the phase space of problems they really help move into feasibility). Yes, if you've written a bunch of programs that use small arrays that are marginally fast enough, then a factor of two slower is painful. But there are many other small array problems that were too slow already that couldn't be done anyway. The ones that weren't marginal will likely still be acceptable. Those that live in the grey zone now are the ones that are most sensitive to the issue. All the rest don't care. I don't have a good feel for how many live in the grey zone. I know some do. Perry Greenfield
Joe Harrington writes:
This is a necessarily long post about the path to an open-source replacement for IDL and Matlab. While I have tried to be fair to those who have contributed much more than I have, I have also tried to be direct about what I see as some fairly fundamental problems in the way we're going about this. I've given it some section titles so you can navigate, but I hope that you will read the whole thing before posting a reply. I fear that this will offend some people, but please know that I value all your efforts, and offense is not my intent.
No offense taken. [...]
THE PROBLEM
We are not following the open-source development model. Rather, we pay lip service to it. Open source's development mantra is "release early, release often". This means release to the public, for use, a package that has core capability and reasonably-defined interfaces. Release it in a way that as many people as possible will get it, install it, use it for real work, and contribute to it. Make the main focus of the core development team the evaluation and inclusion of contributions from others. Develop a common vision for the program, and use that vision to make decisions and keep efforts focused. Include contributing developers in decision making, but do make decisions and move on from them.
Instead, there are no packages for general distribution. The basic interfaces are unstable, and not even being publicly debated to decide among them (save for the past 3 days). The core developers seem to spend most of their time developing, mostly out of view of the potential user base. I am asked probably twice a week by different fellow astronomers when an open-source replacement for IDL will be available. They are mostly unaware that this effort even exists. However, this indicates that there are at least hundreds of potential contributors of application code in astronomy alone, as I don't nearly know everyone. The current efforts look rather more like the GNU project than Linux. I'm sorry if that hurts, but it is true.
I'd both agree with this and disagree. Agree in the sense that many agree these are desireable traits of an open source project. Disagree in the sense that many don't meet all of these traits, and yet may be useful to some degree. Even Python is not released often, nor is it generally packaged by the core group. You will find packaging by special interest group that may or may not be up to date for various platforms. There is a whole spectrum of other, useful open source projects that don't satisfy these requirments. I don't mean that in a defensive way; it's certainly fair to ask what is going wrong in the Python numeric world, but doing the above alone doesn't necessarily guarentee that you will be sucessful in attracting feedback and contributions; there are other factors as well that influence how a project develops. We have had experience with the packaging issue for PyRAF, and it isn't quite so simple, the package binary approach didn't always make life simpler for the user (arguably, we have found the source distribution approach more trouble-free than our original release). Having ones own version of python packaged as a binary raises issues with LD_LIBRARY_PATH that there are just no good solutions to.
I know that Perry's group at STScI and the fine folks at Enthought will say they have to work on what they are being paid to work on. Both groups should consider the long term cost, in dollars, of spending those development dollars 100% on coding, rather than 50% on coding and 50% on outreach and intake. Linus himself has written only a small fraction of the Linux kernel, and almost none of the applications, yet in much less than 7 years Linux became a viable operating system, something much bigger than what we are attempting here. He couldn't have done that himself, for any amount of money. We all know this.
I'd say we have tried our best to solicit input (and accept contributed code as well). You have to remember that how easily contributions come depends on what the critical mass is for usefulness. For something like numarray or Numeric, that critical mass is quite large. Few are interested in contributing when it can do very little and and older package exists that can do more. By the time it has comparable functionality, it is already quite large. A lot of projects like that start with a small group before more join in. There are others where the critical mass is low and many join in when functionality is still relatively low.
THE PATH
Here is what I suggest:
1. We should identify the remaining open interface questions. Not, "why is numeric faster than numarray", but "what should the syntax of creating an array be, and of doing different basic operations". If numeric and numarray are in agreement on these issues, then we can move on, and debate performance and features later.
Well, there are, and continue to be those that can't come to an agreement on even the interface. These issues have been raised many times in the past. Often consensus was hard to achieve. We tended to lean towards backward compatibilty unless the change seemed really necessary. For type coercion and error handling, we thought it was. But I don't think we have tried shield the decision making process from the community. I do think the difficulty in achieving a sense of consensus is a problem. Perhaps we are going about the process in the wrong way; I'd welcome suggestions as to how to improve that.
2. We should identify what we need out of the core plotting capability. Again, not "chaco vs. pyxis", but the list of requirements (as an astronomer, I very much like Perry's list).
3. We should collect or implement a very minimal version of the featureset, and document it well enough that others like us can do simple but real tasks to try it out, without reading source code. That documentation should include lists of things that still need to be done.
4. We should release a stand-alone version of the whole thing in the formats most likely to be installed by users on the four most popular OSs: Linux, Windows, Mac, and Solaris. For Linux, this means .rpm and .deb files for Fedora Core 1 and Debian 3.0r2. Tarballs and CVS checkouts are right out. We have seen that nobody in the real world installs them. To be most portable and robust, it would make sense to include the Python interpreter, named such that it does not stomp on versions of Python in the released operating systems. Static linking likewise solves a host of problems and greatly reduces the number of package variants we will have to maintain.
Static linking also introduces other problems. And we have gone this route in the past so we have some knowledge of what it entails.
5. We should advertize and advocate the result at conferences and elsewhere, being sure to label it what it is: a first-cut effort designed to do a few things well and serve as a platform for building on. We should also solicit and encourage people either to work on the included TODO lists or to contribute applications. One item on the TODO list should be code converters from IDL and Matlab to Python, and compatibility libraries.
6. We should then all continue to participate in the discussions and development efforts that appeal to us. We should keep in mind that evaluating and incorporating code that comes in is in the long run much more efficient than writing the universe ourselves.
7. We should cut and package new releases frequently, at least once every six months. It is better to delay a wanted feature by one release than to hold a release for a wanted feature. The mountain is climbed in small steps.
The open source model is successful because it follows closely something that has worked for a long time: the scientific method, with its community contributions, peer review, open discussion, and progress mainly in small steps. Once basic capability is out there, we can twiddle with how to improve things behind the scenes.
In general, I can't disagree much with most of these. I'm happy for others to smack us when we are going away from this sort of process. Please do, it would be the only way (and others) would learn how to really do it. But we have released fairly frequently, if not with rpms. We do provide pretty good support as well. We have incorporated most of the code sent to us, and considered and implemented many feature requests or performance issues. But the numarray core is not something one would casually change without spending some time understanding how it works; I suspect that is the biggest inhibitor to changes to the core. We are happy to work with others on it if they have the time to do so. If anyone feels we have discouraged people contributing, please let me know (privately if you wish).
IS SCIPY THE WAY?
The recipe above sounds a lot like SciPy. SciPy began as a way to integrate the necessary add-ons to numeric for real work. It was supposed to test, document, and distribute everything together. I am aware that there are people who use it, but the numbers are small and they seem to be tightly connected to Enthought for support and application development. Enthought's focus seems to be on servicing its paying customers rather than on moving SciPy development along, and I fear they are building an installed customer base on interfaces that were not intended to be stable.
I don't feel this is fair to Enthought. It is not my impression that they have made any money off of the scipy distribution directly (Chaco is a different issue). As far as I can tell, the only benefit they've generally gotten from it is from the visibility of sponsoring it, and perhaps from their own use few of the tools they have included as part of it. I doubt that their own clients have driven its development in any significant way. I'd guess they have sunk far more money into scipy than gotten out of it. I don't want others to get the impression that it is the other way around. In fact, on a number of occasions I have heard users complain about the documentation and the standard response is "please help us improve it" with very little in response. They have gone the extra mile in soliciting contributions and help maintaining it. Perhaps it is part of my open source blind spot, but I have trouble seeing what else they could be doing to encourage others to contribute to scipy (besides paying them; which they have done as well!). The only thing I can think of is that because they are doing it, others feel that they don't. Perhaps there is a similar issue with numarray. I don't know.
So, I will raise the question: is SciPy the way? Rather than forking the plotting and numerical efforts from what SciPy is doing, should we not be creating a new effort to do what SciPy has so far not delivered? These are not rhetorical or leading questions. I don't know enough about the motivations, intentions, and resources of the folks at Enthought (and elsewhere) to know the answer. I do think that such a fork will occur unless SciPy's approach changes substantially. The way to decide is for us all to discuss the question openly on these lists, and for those willing to participate and contribute effort to declare so openly. I think all that is needed, either to help SciPy or replace it, is some leadership in the direction outlined above. I would be interested in hearing, perhaps from the folks at Enthought, alternative points of view. Why are there no packages for popular OSs for SciPy 0.2? Why are releases so infrequent? If the folks running the show at scipy.org disagree with many others on these lists, then perhaps those others would like to roll their own. Or, perhaps stable/testing/unstable releases of the whole package are in order.
I think the answer is simple. Supporting distributions of the software they have pulled into scipy is a hell of a lot of work; work that nobody is paying them for. It gives me the shivers to think of our taking on all they have for scipy.
HOW TO CONTRIBUTE?
Judging by the number of PhDs in sigs, there are a lot of researchers on this list. I'm one, and I know that our time for doing core development or providing the aforementioned leadership is very limited, if not zero. Later we will be in a much better position to contribute application software. However, there is a way we can contribute to the core effort even if we are not paid, and that is to put budget items in grant and project proposals to support the work of others. Those others could be either our own employees or subcontractors at places like Enthought or STScI. A handful of contributors would be all we'd need to support someone to produce OS packages and tutorial documentation (the stuff core developers find boring) for two releases a year.
By all means, if there is a groundswell of support for development, please let us know. Perry Greenfield
On Wednesday 21 January 2004 22:28, Perry Greenfield wrote:
contributed code as well). You have to remember that how easily contributions come depends on what the critical mass is for usefulness. For something like numarray or Numeric, that critical mass is quite large. Few are interested in contributing when it can do very little and and older package exists that can do more.
I also find it difficult in practice to move code from Numeric to Numarray. While the two packages coexist peacefully, any C module that depends on the C API must be compiled for one or the other. Having both available for comparative testing thus means having two separate Python installations. And even with two installations, there is only one PYTHONPATH setting, which makes development under these conditions quite a pain. If someone has found a way out of that, please tell me!
many times in the past. Often consensus was hard to achieve. We tended to lean towards backward compatibilty unless the change seemed really necessary. For type coercion and error handling, we thought it was. But I don't think we have tried shield the decision making process from the community. I do think the difficulty in achieving a sense of consensus is a problem.
I think you did well on this - but then, I happen to share your general philosophy ;-) Konrad. -- ------------------------------------------------------------------------------- Konrad Hinsen | E-Mail: hinsen@cnrs-orleans.fr Centre de Biophysique Moleculaire (CNRS) | Tel.: +33-2.38.25.56.24 Rue Charles Sadron | Fax: +33-2.38.63.15.17 45071 Orleans Cedex 2 | Deutsch/Esperanto/English/ France | Nederlands/Francais -------------------------------------------------------------------------------
Good thing Duke is beating Maryland as I read, otherwise, mail like this can make you grumpy. :-) Joe Harrington wrote:
This is a necessarily long post about the path to an open-source replacement for IDL and Matlab. While I have tried to be fair to those who have contributed much more than I have, I have also tried to be direct about what I see as some fairly fundamental problems in the way we're going about this. I've given it some section titles so you can navigate, but I hope that you will read the whole thing before posting a reply. I fear that this will offend some people, but please know that I value all your efforts, and offense is not my intent.
THE PAST VS. NOW
While there is significant and dedicated effort going into numeric/numarray/scipy, it's becoming clear that we are not progressing quickly toward a replacement for IDL and Matlab. I have great respect for all those contributing to the code base, but I think the present discussion indicates some deep problems. If we don't identify those problems (easy) and solve them (harder, but not impossible), we will continue not to have the solution so many people want. To be convinced that we are doing something wrong at a fundamental level, consider that Python was the clear choice for a replacement in 1996, when Paul Barrett and I ran a BoF at ADASS VI on interactive data analysis environments. That was over 7 years ago.
The effort has fallen short of the mark you set. I also wish the community was more efficient at pursuing this goal. There are fundamental issues. (1) The effort required is large. (2) Free time is in short supply. (3) Financial support is difficult to come by for library development. Other potential problems would be a lack of interest and a lack of competence. I do not think many of us suffer from the first. As for competence, the development team beyond the walls of Enthought self selects in open source projects, so we're stuck with what we've got. I know most of the people and happen to think they are a talented bunch, so I'll consider us no worse than the average group of PhDs (some consider that a pretty low bar ...). I believe the tasks that go undone (multi-platform support, bi-yearly releases, documentation, etc.) are more due to (2) and (3) above instead of some other deep (or shallow) issue. I guess another possibility is organization. This can be improved upon. Thanks to the gracious help of Cal Tech (CACR) and NCBR, the community has gathered at a low cost SciPy workshop at Cal Tech the last couple of years. I believe this is a positive step. Adding this to the newsgroups and mailing lists provides us with a solid framework within which to operate. I still have confidence that we will reach the IDL/Matlab replacement point. We don't have the resources that those products have behind them. We do have a superior language, but without a lot of sweat and toiling at hours of grunt work, we don't stand a chance. As for Enthought's efforts, our success in building applications (scientific and otherwise) has diverted our developers (myself included) away from SciPy as the primary focus. We do continue to develop it and provide significant (for us) financial support to maintain it. I am lucky enough to work with a fine set of software engineers, and I am itching to for us to get more time devoted to SciPy. I do believe that we will get the opportunity in the future -- it is just a matter of time. Call me an optimist.
replace IDL or Matlab", the answer was clearly "stable interfaces to basic numerics and plotting; then we can build it from there following the open-source model". Work on both these problems was already well underway then. Now, both the numerical and plotting development efforts have branched. There is still no stable base upon which to build. There aren't even packages for popular OSs that people can install and play with. The problem is not that we don't know how to do numerics or graphics; if anything, we know these things too well. In 1996, if anyone had told us that in 2004 there would be no ready-to-go replacement system because of a factor of 4 in small array creation overhead (on computers that ran 100x as fast as those then available) or the lack of interactive editing of plots at video speeds, the response would not have been pretty. How would you have felt?
THE PROBLEM
We are not following the open-source development model. Rather, we pay lip service to it. Open source's development mantra is "release early, release often". This means release to the public, for use, a package that has core capability and reasonably-defined interfaces.
Release it in a way that as many people as possible will get it, install it, use it for real work, and contribute to it. Make the main focus of the core development team the evaluation and inclusion of contributions from others. Develop a common vision for the program, and use that vision to make decisions and keep efforts focused. Include contributing developers in decision making, but do make decisions and move on from them.
Instead, there are no packages for general distribution. The basic interfaces are unstable, and not even being publicly debated to decide among them (save for the past 3 days). The core developers seem to spend most of their time developing, mostly out of view of the potential user base. I am asked probably twice a week by different fellow astronomers when an open-source replacement for IDL will be available. They are mostly unaware that this effort even exists. However, this indicates that there are at least hundreds of potential contributors of application code in astronomy alone, as I don't nearly know everyone. The current efforts look rather more like the GNU project than Linux. I'm sorry if that hurts, but it is true.
Speaking from the standpoint of SciPy, all I can say is we've tried to do what you outline here. The effort of releasing the huge load of Fortran/C/C++/Python code across multiple platforms is difficult and takes many hours. I would venture that 90% of the effort on SciPy is with the build system. This means that the exact part of the process that you are discussing is the majority of the effort. We keep a version for Windows up to date because that is what our current clients use. In all the other categories, we do the best we can and ask others to fill the gaps. It is also worth saying that SciPy works quite well for most purposes once built -- we and others use it daily on commercial projects.
I know that Perry's group at STScI and the fine folks at Enthought will say they have to work on what they are being paid to work on. Both groups should consider the long term cost, in dollars, of spending those development dollars 100% on coding, rather than 50% on coding and 50% on outreach and intake. Linus himself has written only a small fraction of the Linux kernel, and almost none of the applications, yet in much less than 7 years Linux became a viable operating system, something much bigger than what we are attempting here. He couldn't have done that himself, for any amount of money. We all know this.
Elaborate on the outreach idea for me. Enthought (spend money to) provide funding to core developers outside of our company (Travis and Pearu), we (spend money to) give talks at many conferences a year, we (spend a little money to) co-sponsor a 70 person workshop on scientific computing every year, we have an open mailing list, we release most of the general software that we write, in the past I practically begged people to have CVS write access when they provide a patch to SciPy. We even spent a lot of time early on trying to set up the scipy.org site as a collaborative Zope based environment -- an effort that was largely a failure. Still we have a functioning largely static site, the mailing list, and CVS. As far as tools, that should be sufficient. It is impossible to argue with the results though. Linus pulled off the OS model, and Enthought and the SciPy community, thus far, has been less successful. If there are suggestions beyond "spend more *time* answering email," I am all ears. Time is the most precious commodity of all these days. Also, SciPy has only been around for 3+ years, so I guess we still have a some rope left. I continue to believe it'll happen -- this seems like the perfect project for open source contributions.
THE PATH
Here is what I suggest:
1. We should identify the remaining open interface questions. Not, "why is numeric faster than numarray", but "what should the syntax of creating an array be, and of doing different basic operations". If numeric and numarray are in agreement on these issues, then we can move on, and debate performance and features later.
?? I don't get this one. This interface (at least for numarray) is largely decided. We have argued the points, and Perry et. al. at STSci made the decisions. I didn't like some of them, and I'm sure everyone else had at least one thing they wished was changed, but that is the way this open stuff works. It is not the interface but the implementation that started this furor. Travis O.'s suggestion was to back port (much of) the numarray interface to the Numeric code base so that those stuck supporting large co debases (like SciPy) and needing fast small arrays could benefit from the interface enhancements. One or two of them had backward compatibility issues with Numeric, so he asked how it should be handled. Unless some magic porting fairy shows up, SciPy will be a Numeric only tool for the next year or so. This means that users of SciPy either have to forgo some of these features or back port. On speed: <excerpt from private mail to Perry> Numeric is already too slow -- we've had to recode a number of routines in C that I don't think we should have in a recent project. For us, the goal is not to approach Numeric's speed but to significantly beat it for all array sizes. That has to be a possibility for any replacement. Otherwise, our needs (with the exception of a few features) are already better met by Numeric. I have some worries about all of the endianness and memory mapped support that are built into Numarray imposing to much overhead for speed-ups on small arrays to be possible (this echo's Travis O's thoughts -- we will happily be proven wrong). None of our current work needs these features, and paying a price for them is hard to do with an alternative already there. It is fairly easy to improve its performance on mathematical by just changing the way the ufunc operations are coded. With some reasonably simple changes, Numeric should be comparable (or at least closer) to Numarray speed for large arrays. Numeric also has a large number of other optimizations that can be made (memory is zeroed twice in zeros(), asarray was recently improved significantly for the typical case, etc.). Making these changes would help our selling of Python and, since we have at least a years worth of applications that will be on the SciPy/Numeric platform, it will also help the quality of these applications. Oh yeah, I have also been surprised at how much of out code uses alltrue(), take(), isnan(), etc. The speed of these array manipulation methods is really important for us.
2. We should identify what we need out of the core plotting capability. Again, not "chaco vs. pyxis", but the list of requirements (as an astronomer, I very much like Perry's list).
Yep, we obviously missed on this one. Chaco (and the related libraries) is extremely advanced in some areas but lags in ease-of-use. It is primarily written by a talented and experienced computer scientist (Dave Morrill) who likely does not have the perspective of an astronomer. It is clear that areas of the library need to be re-examined, simplified, and improved. Unfortunately, there is not time for us to do that right now, and the internals have proven to complex for others to contribute to in a meaningful way. I do not know when this will be addressed. The sad thing here is that STSci won't be using it. That pains me to no end, and Perry and I have tried to figure out some way to make it work for them. But, it sounds like, at least in the short term, there will be two new additions to the plotting stable. We will work hard though to make the future Chaco solve STSci's problems (and everyone elses) better than it currently does. By the way, there is a lot of Chaco bashing going on. It is worth saying that we use Chaco every day in commercial applications that require complex graphics and heavy interactivity with great success. But, we also have mixed teams of scientists and computer scientists along with the "U Manual" (If I have a question, I ask you -- being Dave) to answer any questions. I continue to believe Chaco's Traits based approach is the only one currently out there that has the chance of improving on Matlab and other plotting packages available. And, while SciPy is moving slowly, Chaco is moving at a frantic development pace and gets new capabilities daily (which is part of the complaints about it). I feel certain in saying that it has more resources tied to its development that the other plotting option out there -- it is just currently being exercised in GUI environments instead of as a day-to-day plotting tool. My advice is dig in, learn traits, and learn Chaco.
3. We should collect or implement a very minimal version of the featureset, and document it well enough that others like us can do simple but real tasks to try it out, without reading source code. That documentation should include lists of things that still need to be done.
4. We should release a stand-alone version of the whole thing in the formats most likely to be installed by users on the four most popular OSs: Linux, Windows, Mac, and Solaris. For Linux, this means .rpm and .deb files for Fedora Core 1 and Debian 3.0r2. Tarballs and CVS checkouts are right out. We have seen that nobody in the real world installs them. To be most portable and robust, it would make sense to include the Python interpreter, named such that it does not stomp on versions of Python in the released operating systems. Static linking likewise solves a host of problems and greatly reduces the number of package variants we will have to maintain.
5. We should advertize and advocate the result at conferences and elsewhere, being sure to label it what it is: a first-cut effort designed to do a few things well and serve as a platform for building on. We should also solicit and encourage people either to work on the included TODO lists or to contribute applications. One item on the TODO list should be code converters from IDL and Matlab to Python, and compatibility libraries.
6. We should then all continue to participate in the discussions and development efforts that appeal to us. We should keep in mind that evaluating and incorporating code that comes in is in the long run much more efficient than writing the universe ourselves.
7. We should cut and package new releases frequently, at least once every six months. It is better to delay a wanted feature by one release than to hold a release for a wanted feature. The mountain is climbed in small steps.
The open source model is successful because it follows closely something that has worked for a long time: the scientific method, with its community contributions, peer review, open discussion, and progress mainly in small steps. Once basic capability is out there, we can twiddle with how to improve things behind the scenes.
Everything here is great -- it is the implementation part that is hard. I am all for it happening though.
IS SCIPY THE WAY?
The recipe above sounds a lot like SciPy. SciPy began as a way to integrate the necessary add-ons to numeric for real work. It was supposed to test, document, and distribute everything together. I am aware that there are people who use it, but the numbers are small and they seem to be tightly connected to Enthought for support and application development.
Not so. The user base is not huge, but I would conservatively venture to say it is in the hundreds to thousands. We are a company of 12 without a single support contract for SciPy.
Enthought's focus seems to be on servicing its paying customers rather than on moving SciPy development along,
Continuing to move SciPy along at the pace we initially were would have ended Enthought -- something had to change. It is surprising how important paying customers are to a company.
and I fear they are building an installed customer base on interfaces that were not intended to be stable.
Not sure what you you mean here, but I'm all for stable interfaces. Huge portions of SciPy's interface haven't changed, and I doubt they will change. I do indeed feel, though, that SciPy is still a 0.2 release level, so some of the interfaces can change. It would be irresponsible to say otherwise. This is not "intentionally unstable" though...
So, I will raise the question: is SciPy the way? Rather than forking the plotting and numerical efforts from what SciPy is doing, should we not be creating a new effort to do what SciPy has so far not delivered? These are not rhetorical or leading questions. I don't know enough about the motivations, intentions,
Man this sounds like an interview (or interaction) question. We'll we're a company, so we do wish to make money -- otherwise, we'll have to do something else. We also care about deeply about science and are passionate about scientific computing. Let see, what else. We have made most of the things we do open source because we do believe in it in principle and as a good development philosophy. And, even though we all wish SciPy was moving faster, SciPy wouldn't be anywhere close to where it is without Travis Oliphant and Pearu Peterson -- neither of whom would have worked on it had it not been openly available. That alone validates the decision to make it open. I'm not sure what we have done to make someone question our "motivations and intentions" (sounds like a date interrogation), but it is hard to think of malicious ones when you are making the fruits of your labors and dollars freely available.
and resources of the
Well, we have 12 people, and Pearu and Travis O work with us quite a bit also. The developers here are very good (if I do say so myself), but unfortunately primarily working on other projects at the moment. Besides scientists/computer scientists have a technical writer and a human-computer-interface specialist on staff.
folks at Enthought (and elsewhere) to know the answer. I do think that such a fork will occur unless SciPy's approach changes substantially.
Enthought has more commitments than we used to. SciPy remains important and core to what we do, it just has to share time with other things. Luckily Pearu and Travis have kept there ear to the ground to help out people on the mailing lists as well as working on the codebase. I'm not sure what our approach has been that would force a fork... It isn't like someone has come as asked to be release manager, offered to keep the web pages up to date, provided peer review of code, etc and we have turned them away. Almost from the beginning most effort is provided by a small team (fairly standard for OS stuff). We have repeatedly pointed out areas we need help at the conference and in mail -- code reviews, build help, release help, etc. In fact, I double dare ya to ask to manage the next release or the documentation effort. okay... triple dare ya. Some people have philosophical (like Konrad I believe) differences with how SciPy is packaged and believe it should be 12 smaller packages instead of one large one. This has its own set of problems obviously, but forking based on this kind of principle would make at least a modicum of sense. Forking because you don't like the pace of the project makes zero sense. Pitch in and solve the problem. The social barriers are very small. The code barriers (build, etc.) are what need to be solved.
The way to decide is for us all to discuss the question openly on these lists, and for those willing to participate and contribute effort to declare so openly. I think all that is needed, either to help SciPy or replace it, is some leadership in the direction outlined above. I would be interested in hearing, perhaps from the folks at Enthought, alternative points of view. Why are there no packages for popular OSs for SciPy 0.2?
Please build them, ask for web credentials, and up load them. Then answer the questions people have about them on the mailing list. It is as simple as that. There is no magic here -- just work.
Why are releases so infrequent?
Ditto.
If the folks running the show at scipy.org disagree with many others on these lists, then perhaps those others would like to roll their own. Or, perhaps stable/testing/unstable releases of the whole package are in order.
HOW TO CONTRIBUTE?
Judging by the number of PhDs in sigs, there are a lot of researchers on this list. I'm one, and I know that our time for doing core development or providing the aforementioned leadership is very limited, if not zero.
Surprisingly, commercial developers have about the same amount of free time.
Later we will be in a much better position to contribute application software. However, there is a way we can contribute to the core effort even if we are not paid, and that is to put budget items in grant and project proposals to support the work of others.
For the academics, supporting a *dedicated* student to maintain SciPy would be much more cost effective use of your dollars. Unfortunately, it is hard to get a PhD for supporting SciPy... <begin shameless plugs that somehow seem appropriate here> For companies, national laboratories, etc. Supporting development on SciPy (or numarray) directly is a great idea. Projects that we work on in other areas also indirectly support SciPy, Chaco, etc. so get us involved with the development efforts at your company/lab. Other options? Government (NASA, Military, NIH, etc) and national lab people can get SciPy/numarray/Python related SBIR (http://www.acq.osd.mil/sadbu/sbir/) topics that would impact there research/development put on the solicitation list this summer. Email me if you have any questions on this. ASCI people can propose PathForward projects. There are probably numerous other ways to do this. We will have a GSA schedule soon, so government contracting will also work.
subcontractors at places like Enthought or STScI. A handful of contributors would be all we'd need to support someone to produce OS packages and tutorial documentation (the stuff core developers find boring) for two releases a year.
Joe, as you say, things haven't gone as fast as any of us would wish, but it hasn't been for lack of trying. Many of us have put zillions of hours into this. The results are actually quite stable tools. Many people use Numeric/Numarray/SciPy in daily work without problems. But, like Linux in the early years, they still require "geeks" willing to do some amount of meddling to use them. Huge resources (developer and financial) have been pumped into Linux to get it to the point its at today. Anything we can do to increase the participation in building tools and financially supporting those who do build tools, I am all for... I'd love to see releases on 10 platforms and full documentation for the libraries as well as the next person. Whew, and Duke managed to hang on and win. my .01 worth, eric
--jh--
------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
As a relative newcomer to this discussion, I would like to respond on a couple of points. eric jones wrote:
Good thing Duke is beating Maryland as I read, otherwise, mail like this can make you grumpy. :-)
Joe Harrington wrote:
[snip]
THE PATH
Here is what I suggest:
1. We should identify the remaining open interface questions. Not, "why is numeric faster than numarray", but "what should the syntax of creating an array be, and of doing different basic operations". If numeric and numarray are in agreement on these issues, then we can move on, and debate performance and features later.
?? I don't get this one. This interface (at least for numarray) is largely decided. We have argued the points, and Perry et. al. at STSci made the decisions. I didn't like some of them, and I'm sure everyone else had at least one thing they wished was changed, but that is the way this open stuff works.
I have wondered whether the desire to be compatible with Numeric has been an inhibitory factor for numarray. It might be interesting to see the list of decisions which Eric Jones doesn't like.
It is not the interface but the implementation that started this furor. Travis O.'s suggestion was to back port (much of) the numarray interface to the Numeric code base so that those stuck supporting large co debases (like SciPy) and needing fast small arrays could benefit from the interface enhancements. One or two of them had backward compatibility issues with Numeric, so he asked how it should be handled. Unless some magic porting fairy shows up, SciPy will be a Numeric only tool for the next year or so. This means that users of SciPy either have to forgo some of these features or back port.
Back porting would appear, to this outsider, to be a regression. Is there no way of changing numarray so that it has the desired speed for small arrays?
On speed: <excerpt from private mail to Perry> Numeric is already too slow -- we've had to recode a number of routines in C that I don't think we should have in a recent project. For us, the goal is not to approach Numeric's speed but to significantly beat it for all array sizes. That has to be a possibility for any replacement. Otherwise, our needs (with the exception of a few features) are already better met by Numeric. I have some worries about all of the endianness and memory mapped support that are built into Numarray imposing to much overhead for speed-ups on small arrays to be possible (this echo's Travis O's thoughts -- we will happily be proven wrong). None of our current work needs these features, and paying a price for them is hard to do with an alternative already there. It is fairly easy to improve its performance on mathematical by just changing the way the ufunc operations are coded. With some reasonably simple changes, Numeric should be comparable (or at least closer) to Numarray speed for large arrays. Numeric also has a large number of other optimizations that can be made (memory is zeroed twice in zeros(), asarray was recently improved significantly for the typical case, etc.). Making these changes would help our selling of Python and, since we have at least a years worth of applications that will be on the SciPy/Numeric platform, it will also help the quality of these applications.
Oh yeah, I have also been surprised at how much of out code uses alltrue(), take(), isnan(), etc. The speed of these array manipulation methods is really important for us.
I am surprised that alltrue() performance is a concern, but it should be easy to implement short circuit evaluation so that False responses are, on average, handled more quickly. If Boolean arrays are significant, in terms of the amount of computer time taken, should they be stored as bit arrays? Would there be a pay-off for the added complexity?
[snip]
3. We should collect or implement a very minimal version of the featureset, and document it well enough that others like us can do simple but real tasks to try it out, without reading source code. That documentation should include lists of things that still need to be done.
Does numarray not provide the basics?
[snip The open source model is successful because it follows closely something that has worked for a long time: the scientific method, with its community contributions, peer review, open discussion, and progress mainly in small steps. Once basic capability is out there, we can twiddle with how to improve things behind the scenes.
Colin W.
Colin J. Williams writes:
I have wondered whether the desire to be compatible with Numeric has been an inhibitory factor for numarray. It might be interesting to see the list of decisions which Eric Jones doesn't like.
There weren't that many. The ones that I remember (and if Eric has time he can fill in the rest) were: 1) default axis for operations. Some use the last and some use the first depending on context. Eric and Travis wanted to use a consistent rule (I believe last always). I believe that scipy wraps Numeric so that it does just that (remember, the behavior in scipy of Numeric is not quite the same as the distributed Numeric (correct me if I'm wrong). 2) allowing complex comparisons. Since Python no longer allows these (and it is reasonable to question whether this was right since complex numbers now can no longer be part of a generic python sort), Many felt that numarray should be consistent with Python. This isn't a big issue since I had argued that those that wanted to do generic comparisons simply needed to cast it as x.real where the .real attribute was available for all types of arrays, thus using that would always work regardless of the type. 3) having single-element indexing return a rank-0 array rather than a python scalar. Numeric is quite inconsistent in this regard now. We decided to have numarray always return python scalars (exceptions may be made if Float128 is supported). The argument for rank-0 arrays was that it would support generic programming so that one didn't need to test for the kind of value for many functions (i.e., scalar or array). But the issue of contention was that Eric argued that len(rank-0) == 1 and that (rank-0)[0] give the value, neither of which is correct according to the strict definition of rank-0. We argued that using rank-1 len-1 arrays were really what was needed for that kind of programming. It turned out that the most common need was for the result of reduction operations, so we provided a version of reduce (areduce) which always returned an array result even if the array was 1-d, (the result would be a length-1 rank-1 array). There are others, but I don't recall immediately.
It is not the interface but the implementation that started this furor. Travis O.'s suggestion was to back port (much of) the numarray interface to the Numeric code base so that those stuck supporting large co debases (like SciPy) and needing fast small arrays could benefit from the interface enhancements. One or two of them had backward compatibility issues with Numeric, so he asked how it should be handled. Unless some magic porting fairy shows up, SciPy will be a Numeric only tool for the next year or so. This means that users of SciPy either have to forgo some of these features or back port.
Back porting would appear, to this outsider, to be a regression. Is there no way of changing numarray so that it has the desired speed for small arrays?
If it must be faster than Numeric, I do wonder if that is easily done without greatly complicating the code.
I am surprised that alltrue() performance is a concern, but it should be easy to implement short circuit evaluation so that False responses are, on average, handled more quickly. If Boolean arrays are significant, in terms of the amount of computer time taken, should they be stored as bit arrays? Would there be a pay-off for the added complexity?
Making alltrue fast in numarray would not be hard. Just some work writing a special purpose function to short circuit. I doubt very much bit arrays would be much faster. They would also greatly complicate the code base. It is possible to add them, but I've always felt the reason would be to save memory, not increase speed. They haven't been high priority for us.
Perry Greenfield
On Thu, 22 Jan 2004, eric jones wrote:
The effort has fallen short of the mark you set. I also wish the community was more efficient at pursuing this goal. There are fundamental issues. (1) The effort required is large. (2) Free time is in short supply. (3) Financial support is difficult to come by for library development.
(4) There is no itch to scratch Matlab is somewhere about $20,000 (base+a couple of toolboxes) per year for corporations, and something like $500 (or less) for registered students. All of the signal processing packages and stuff are all written for Matlab. The time cost of learning a new tool (Python + SciPy + Numeric/numarray) far exceeds the base prices for the average company or person. However, some companies have to deliver an end product with Matlab embedded. This is *extremely* undesirable; consequently, they are likely to create add-ons and extend the Python interface. However, the progress will likely be slow.
Speaking from the standpoint of SciPy, all I can say is we've tried to do what you outline here. The effort of releasing the huge load of Fortran/C/C++/Python code across multiple platforms is difficult and takes many hours.
And since SciPy is mostly Windows, the users expect that one click installs the universe. Good for customer experience. Bad for maintainability which would really like to have independently maintained packages with hard API's surrounding them..
On speed: <excerpt from private mail to Perry> Numeric is already too slow -- we've had to recode a number of routines in C that I don't think we should have in a recent project.
Then the idea of optimizing numarray is DOA. The best you are going to get is a constant factor speedup in return for vastly complicating maintainability. That's not a good tradeoff for a multi-year open-source project.
Oh yeah, I have also been surprised at how much of out code uses alltrue(), take(), isnan(), etc. The speed of these array manipulation methods is really important for us.
That seems ... odd. Scanning an array rather than handling a NaN trap seems like an awful tradeoff (ie. an O(n) operation repeated every time rather than an O(1) operation activated only on NaN generation--a rare occurrence normally).
-- code reviews, build help, release help, etc. In fact, I double dare ya to ask to manage the next release or the documentation effort. okay... triple dare ya.
Shades of, "Take my wife ... please!" ;) -a
Andrew P. Lentvorski, Jr. wrote:
On Thu, 22 Jan 2004, eric jones wrote:
Speaking from the standpoint of SciPy, all I can say is we've tried to do what you outline here. The effort of releasing the huge load of Fortran/C/C++/Python code across multiple platforms is difficult and takes many hours.
And since SciPy is mostly Windows, the users expect that one click installs the universe. Good for customer experience. Bad for maintainability which would really like to have independently maintained packages with hard API's surrounding them..
What in the world does this mean? SciPy is "mostly Windows" Yes, there is a only a binary installer for windows available currently. But, how does that make this statement true. For me SciPy has always been used almost exclusively on Linux. In fact, the best plotting support for SciPy (in my mind) is xplt (pygist-based) and it works best on Linux. -Travis
On Thu, 22 Jan 2004, Travis E. Oliphant wrote:
What in the world does this mean? SciPy is "mostly Windows" Yes, there is a only a binary installer for windows available currently. But, how does that make this statement true.
For me SciPy has always been used almost exclusively on Linux. In fact, the best plotting support for SciPy (in my mind) is xplt (pygist-based) and it works best on Linux.
I was referring to the installers, but I apparently did a thinko and omitted the reference. My apologies. I did not mean to imply that SciPy runs only on Windows, especially since I run it on FreeBSD. My intent was to comment about Win32 having a "one big lump" installer philosophy vs. the Linux "discrete packages" philosophy and the impact on maintainability of each. ie. the fact that releases suck up so much energy because of the need to integrate large chunks of code outside of SciPy itself. -a
participants (7)
-
Andrew P. Lentvorski, Jr.
-
Colin J. Williams
-
eric jones
-
Joe Harrington
-
Konrad Hinsen
-
Perry Greenfield
-
Travis E. Oliphant