Re: [Numpy-discussion] Speeding up numarray -- questions on its design
data:image/s3,"s3://crabby-images/25590/25590978e6ee8d8320bdf70e2e39cd3e3700b7ab" alt=""
Travis Oliphant wrote:
I have some comments based on perusing it's source. I don't want to seem overly critical, so please take my comments with the understanding that I appreciate the extensive work that has gone into Numarray. I do think that Numarray has made some great strides. I would really like to see a unification of Numeric and Numarray.
1) Are there plans to move the nd array entirely into C? -- I would like to see the nd array become purely a c-type. I would be willing to help here. I can see that part of the work has been done.
I don't know that I would say they are definite, but I think that at some point we thought that would be necessary. We haven't yet since doing so makes it harder to change so it would be one of the last changes to the core that we would want to do. Our current priorities are towards making all the major libraries and packages available under it first and then finishing optimization issues (another issue that has to be tackled soon is handling 64-bit addressing; apparently the work to make Python sequences use 64-bit addresses is nearing completion so we want to be able to handle that. I expect we would want to make sure we find a way of handling that before we turn it all into C but maybe it is just as easy doing them in the opposite order.
2) Why is the ND array C-structure so large? Why are the dimensions and strides array static? Why can't the extra stuff that the fancy arrays need be another structure and the numarray C structure just extended with a pointer to the extra stuff?
When Todd moved NDArray into C, he tried to keep it simple. As such, it has no "moving parts." We think making dimensions and strides malloc'ed rather than static would be fairly easy. Making the "extra stuff" variable is something we can look at. The bottom line is that adding the variability adds complexity and we're not sure we understand the storage economics of why we would doing it. Numarray was designed, first and foremost, for large arrays. For that case, the array struct size is irrelevant whereas additional complexity is not. I guess we would like to see some good practical examples where the array struct size matters. Do you have code with hundreds of thousands of small arrays existing simultaneously?
3) There seem to be too many files to define the array. The mixture of Python and C makes trying to understand the source very difficult. I thought one of the reasons for the re-write was to simplify the source code.
I think this reflects the transitional nature of going from mostly Python to a hybrid. We agree that the current state is more convoluted than it ought to be. If NDarray were all C, much of this would ended (though in some respects, being all in C will make it larger, harder to understand as well). The original hope was that most of the array setup computation could be kept in Python but that is what made it slow for small arrays (but it did allow us to implement it reasonably quickly with big array performance so that we could start using for our own projects without a long development effort). Unfortunately, the simplification in the rewrite is offset by handling the more complex cases (byte-swapping, etc.) and extra array indexing capabilities.
4) Object arrays must be supported. This was a bad oversight and an important feature of Numeric arrays.
The current implementation does support them (though in a different way, and generally not as efficiently, though Todd is more up on the details here). What aspect of object arrays are you finding lacking? C-api?
5) The ufunc code interface needs to continue to be improved. I do see that some effort into understanding the old ufunc interface has taken place which is a good sign.
You are probably referring to work underway to integrate with scipy (I'm assuming you are looking at the version in CVS).
Again, thanks to the work that has been done. I'm really interested to see if some of these modifications can be done as in my mind it will help the process of unifying the two camps.
I'm glad to see that you are taking a look at it and welcome the comments and any offers of help in improving speed. Perry
data:image/s3,"s3://crabby-images/3e77e/3e77e0c7b230ae4e3084fe287fff552e1d2c37de" alt=""
Hi all, just some comments from the sidelines, while I applaud the fact that we are moving towards a successful numeric/numarray integration. Perry Greenfield wrote:
the array struct size matters. Do you have code with hundreds of thousands of small arrays existing simultaneously?
I do have code with perhaps ~100k 'small' arrays (12x12x12 or so) in existence simultaneously, plus a few million created temporarily as part of the calculations. Needless to say, this uses Numeric :) What's been so nice about Numeric is that even with my innermost loops (carefully) coded in python, I get very acceptable performance for real-world problems. Perrry and I had this conversation over at scipy'04, so this is just a reminder. The Blitz++ project has faced similar problems of performance for their very flexible arrays classes, and their approach has been to have separate TinyVector/TinyMatrix classes. These do not offer almost any of the fancier features of the default Blitz Arrays, but they keep the same syntactic behavior and identical semantics where applicable. What they give up in flexibility, they gain in performance. I realize this requires a substantial amount of work, but perhaps it will be worthwhile in the long run. It would be great to have a numarray small_array() object which would not allow byteswapping, memory-mapping, or any of the extra features which make them memory and time consuming, but which would maintain compatibility with the regular arrays as far as arithmetic operators and ufunc application (including obviously lapack/blas/f2py usage). I know I am talking from 50.000 feet up, so I'm sure once you get down to the details this will probably not be easy (I can already see difficulties with the size of the underlying C structures for C API compatibility). But in the end, I think something like this might be the only way to satisfy all the disparate usage cases for numerical arrays in scientific computing. Besides the advanced features disparity, a simple set of guidelines for the crossover points in terms of performance would allow users to choose in their own codes what to use. At any rate, I'm extremely happy to see scipy/numarray integration moving forward. My thanks to all those who are actually doing the hard work. Regards, f
data:image/s3,"s3://crabby-images/a03e9/a03e989385213ae76a15b46e121c382b97db1cc3" alt=""
Hi all, This discussion has brought up a question I have had for a while: Can anyone provide a one-paragraph description of what numarray does that gives it better large-array performance than Numeric? By the way, For what it's worth, what's kept me from switching is the small array performance, and/or the array-creation performance. I don't use very large arrays, but I do use small ones all the time. thanks, -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
data:image/s3,"s3://crabby-images/767aa/767aa59361922cb7ae7703188a1a4b4828a9120c" alt=""
On Mon, 2005-01-17 at 11:12, Perry Greenfield wrote:
Travis Oliphant wrote:
3) There seem to be too many files to define the array. The mixture of Python and C makes trying to understand the source very difficult. I thought one of the reasons for the re-write was to simplify the source code.
I think this reflects the transitional nature of going from mostly Python to a hybrid. We agree that the current state is more convoluted than it ought to be. If NDarray were all C, much of this would ended (though in some respects, being all in C will make it larger, harder to understand as well). The original hope was that most of the array setup computation could be kept in Python but that is what made it slow for small arrays (but it did allow us to implement it reasonably quickly with big array performance so that we could start using for our own projects without a long development effort). Unfortunately, the simplification in the rewrite is offset by handling the more complex cases (byte-swapping, etc.) and extra array indexing capabilities.
I took a cursory look at the C API the other day and learned about this capability to process byte-swapped data. I am wondering why this is a good thing to have. Wouldn't it be enough and much easier to drop this feature and instead equip numarray IO routines with the capability to convert to and from a foreign endian to the host endian encoding? ralf
data:image/s3,"s3://crabby-images/25590/25590978e6ee8d8320bdf70e2e39cd3e3700b7ab" alt=""
Ralf Juengling wrote:
I took a cursory look at the C API the other day and learned about this capability to process byte-swapped data. I am wondering why this is a good thing to have. Wouldn't it be enough and much easier to drop this feature and instead equip numarray IO routines with the capability to convert to and from a foreign endian to the host endian encoding?
Basically this feature was to allow use of memory mapped data that didn't use the native representation of the processor (also related to supporting record arrays). The details are given in a paper a couple years ago: http://www.stsci.edu/resources/software_hardware/numarray/papers/pycon2003.p df Perry
data:image/s3,"s3://crabby-images/dbff1/dbff1dee826e4fc0a89b2bc2d2dac814c15fe85d" alt=""
Thanks for the comments that have been made. One of my reasons for commenting is to get an understanding of which design issues of Numarray are felt to be important and which can change. There seems to be this idea that small arrays are not worth supporting. I hope this is just due to time-constraints and not some fundamental idea that small arrays should never be considered with Numarray. Otherwise, there will always be two different array implementations developing at their own pace. I really want to gauge how willing developers of numarray are to changing things. Perry Greenfield wrote:
1) Are there plans to move the nd array entirely into C? -- I would like to see the nd array become purely a c-type. I would be willing to help here. I can see that part of the work has been done.
I don't know that I would say they are definite, but I think that at some point we thought that would be necessary. We haven't yet since doing so makes it harder to change so it would be one of the last changes to the core that we would want to do. Our current priorities are towards making all the major libraries and packages available under it first and then finishing optimization issues (another issue that has to be tackled soon is handling 64-bit addressing; apparently the work to make Python sequences use 64-bit addresses is nearing completion so we want to be able to handle that. I expect we would want to make sure we find a way of handling that before we turn it all into C but maybe it is just as easy doing them in the opposite order.
I do not think it would be difficult at this point to move it all to C and then make future changes there (you can always call pure Python code from C). With the structure in place and some experience behind you, now seems like as good a time as any. Especially, because now is a better time for me than any... I like what numarray is doing by not always defaulting to ints with the maybelong type. It is a good idea.
2) Why is the ND array C-structure so large? Why are the dimensions and strides array static? Why can't the extra stuff that the fancy arrays need be another structure and the numarray C structure just extended with a pointer to the extra stuff?
When Todd moved NDArray into C, he tried to keep it simple. As such, it has no "moving parts." We think making dimensions and strides malloc'ed rather than static would be fairly easy. Making the "extra stuff"
variable is something we can look at.
But allocating dimensions and strides when needed is not difficult and it reduces the overhead of the ndarray object. Currently, that overhead seems extreme. I could be over-reacting here, but it just seems like it would have made more sense to expand the array object as little as possible to handle the complexity that you were searching for. It seems like more modifications were needed in the ufunc then in the arrayobject.
The bottom line is that adding the variability adds complexity and we're not sure we understand the storage economics of why we would doing it. Numarray was designed, first and foremost, for large arrays.
Small arrays are never going to disappear (Fernando Perez has an excellent example) and there are others. A design where a single pointer not being NULL is all that is needed to distinguish "simple" Numeric-like arrays from "fancy" numarray-like arrays seems like a great way to make sure that
For that case, the array struct size is irrelevant whereas additional complexity is not. I guess we would like to see some good practical examples where the array struct size matters. Do you have code with hundreds of thousands of small arrays existing simultaneously?
As mentioned before, such code exists especially when arrays become a basic datatype that you use all the time. How much complexity is really generated by offloading the extra struct material to a bigarray structure, thereby only increasing the Numeric array structure by 4 bytes instead of 200+? On another fundamental note, numarray is being sold as a replacement for Numeric. But, then, on closer inspection many things that Numeric does well, numarray is ignoring or not doing very well. I think this presents a certain amount of false advertising to new users, who don't understand the history. Most of them would probably never need the fanciness that numarray provides and would be quite satisfied with Numeric. They just want to know what others are using. I think it is a disservice to call numarray a replacement for Numeric until it actually is. It should currently be called an "alternative implementation" focused on large arrays. This (unintentional) slight of hand that has been occurring over the past year has been my biggest complaint with numarray. Making numarray a replacement for Numeric means that it has to support small arrays, object arrays, and ufuncs at least as well as but preferably better than Numeric. It should also be faster than Numeric whenever possible, because Numeric has lots of potential optimizations that have never been applied. If numarray does not do these things, then in my mind it cannot be a replacement for Numeric and should stop being called that on the numpy web site.
3) There seem to be too many files to define the array. The mixture of Python and C makes trying to understand the source very difficult. I thought one of the reasons for the re-write was to simplify the source code.
I think this reflects the transitional nature of going from mostly Python to a hybrid. We agree that the current state is more convoluted than it ought to be. If NDarray were all C, much of this would ended (though in some respects, being all in C will make it larger, harder to understand as well). The original hope was that most of the array setup computation could be kept in Python but that is what made it slow for small arrays (but it did allow us to implement it reasonably quickly with big array performance so that we could start using for our own projects without a long development effort). Unfortunately, the simplification in the rewrite is offset by handling the more complex cases (byte-swapping, etc.) and extra array indexing capabilities.
I never really understood the "code is too complicated" argument anyway. I was just wondering if there is some support for reducing the number of source code files, or reorganizing them a bit.
4) Object arrays must be supported. This was a bad oversight and an important feature of Numeric arrays.
The current implementation does support them (though in a different way, and generally not as efficiently, though Todd is more up on the details here). What aspect of object arrays are you finding lacking? C-api?
I did not see such support when I looked at it, but given the previous comment, I could easily have missed where that support is provided. I'm mainly following up on Konrad's comment that his Automatic differentiation does not work with Numarray because of the missing support for object arrays. There are other applications for object arrays as well. Most of the support needs to come from the ufunc side.
5) The ufunc code interface needs to continue to be improved. I do see that some effort into understanding the old ufunc interface has taken place which is a good sign.
You are probably referring to work underway to integrate with scipy (I'm assuming you are looking at the version in CVS).
Yes, I'm looking at the CVS version.
Again, thanks to the work that has been done. I'm really interested to see if some of these modifications can be done as in my mind it will help the process of unifying the two camps.
I'm glad to see that you are taking a look at it and welcome the comments and any offers of help in improving speed.
I would be interested in helping if there is support for really making numarray a real replacement for Numeric, by addressing the concerns that I've outlined. As stated at the beginning, I'm really just looking for how receptive numarray developers would be to the kinds of changes I'm talking about: (1) reducing the size of the array structure, (2) moving the ndarray entirely into C, (3) improving support for object arrays, (4) improving ufunc API support. I care less about array and ufunc C-API names being the same then the underlying capabilities being available. Best regards, -Travis Oliphant
data:image/s3,"s3://crabby-images/29700/297005d2b7551c7d859c8ec2d37292c5f9781cfc" alt=""
I haven't followed this discussion in detail but with respect to space for 'descriptors', it would simply be foolish to malloc space for these. The cost is ridiculous. You simply have to decide how big a number of dimensions to allow, make it a clearly findable definition in the sources, and dimension everything that big. Originally when we discussed this we considered 7, since that had been (and for all I know still is) the maximum array dimension in Fortran. But Jim Huginin needed 11 or something like it for his imaging. I've seen 40 in the numarray sources I think. It seems to me that an application that would care about this space (it being, after all, per array object) would be unusual indeed. If I've misunderstood what you're talking about, never mind. (:-> My advice is to make flexibility secondary to performance. It is always possible to layer on flexibility for those who want it.
data:image/s3,"s3://crabby-images/dbff1/dbff1dee826e4fc0a89b2bc2d2dac814c15fe85d" alt=""
Paul Dubois wrote:
I haven't followed this discussion in detail but with respect to space for 'descriptors', it would simply be foolish to malloc space for these. The cost is ridiculous. You simply have to decide how big a number of dimensions to allow, make it a clearly findable definition in the sources, and dimension everything that big.
Thanks for this comment. I can see now that it makes sense as it would presumably speed up small array creation. Why was this not done in the original sources?
Originally when we discussed this we considered 7, since that had been (and for all I know still is) the maximum array dimension in Fortran. But Jim Huginin needed 11 or something like it for his imaging. I've seen 40 in the numarray sources I think.
It seems to me that an application that would care about this space (it being, after all, per array object) would be unusual indeed.
If I've misunderstood what you're talking about, never mind. (:->
I think you've understood this part of it and have given good advice.
My advice is to make flexibility secondary to performance. It is always possible to layer on flexibility for those who want it.
I like this attitude. -Travis
data:image/s3,"s3://crabby-images/25590/25590978e6ee8d8320bdf70e2e39cd3e3700b7ab" alt=""
Paul Dubois wrote:
I haven't followed this discussion in detail but with respect to space for 'descriptors', it would simply be foolish to malloc space for these. The cost is ridiculous. You simply have to decide how big a number of dimensions to allow, make it a clearly findable definition in the sources, and dimension everything that big.
Originally when we discussed this we considered 7, since that had been (and for all I know still is) the maximum array dimension in Fortran. But Jim Huginin needed 11 or something like it for his imaging. I've seen 40 in the numarray sources I think.
Actually, 40 came from Numeric. It may have been reduced to 11, but I'm sure it was 40 at one point. Jim even had a comment in the code to the effect that if someone needed more than 40, he wanted to see the problem that needed that. If people think it is too high, I'd be very happy to reduce it. Perry
data:image/s3,"s3://crabby-images/b44fb/b44fbb1bc6e70ae4de6b9d1dac852fec465b5506" alt=""
Travis Oliphant wrote:
4) Object arrays must be supported. This was a bad oversight and an important feature of Numeric arrays.
The current implementation does support them (though in a different way, and generally not as efficiently, though Todd is more up on the details here). What aspect of object arrays are you finding lacking? C-api?
I did not see such support when I looked at it, but given the previous comment, I could easily have missed where that support is provided. I'm mainly following up on Konrad's comment that his Automatic differentiation does not work with Numarray because of the missing support for object arrays. There are other applications for object arrays as well. Most of the support needs to come from the ufunc side.
It's tucked away in numarray.objects. Unfortunately for Konrad's application, numarray ufuncs don't recognize that it's being passed an object with the special methods defined, and they won't automatically create 0-D object "arrays". 0-D object arrays will work just fine when using operators (x+y works), but not when explicitly calling the ufuncs (add(x,y) does not work). Both methods work fine for 0-D numerical arrays. -- Robert Kern rkern@ucsd.edu "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter
data:image/s3,"s3://crabby-images/bb0fe/bb0fe79cf224d6b3d110ec3edf1a5a7dc2ffdf50" alt=""
Robert Kern <rkern@ucsd.edu> writes:
Travis Oliphant wrote:
4) Object arrays must be supported. This was a bad oversight and an important feature of Numeric arrays.
The current implementation does support them (though in a different way, and generally not as efficiently, though Todd is more up on the details here). What aspect of object arrays are you finding lacking? C-api? I did not see such support when I looked at it, but given the previous comment, I could easily have missed where that support is provided. I'm mainly following up on Konrad's comment that his Automatic differentiation does not work with Numarray because of the missing support for object arrays. There are other applications for object arrays as well. Most of the support needs to come from the ufunc side.
It's tucked away in numarray.objects. Unfortunately for Konrad's application, numarray ufuncs don't recognize that it's being passed an object with the special methods defined, and they won't automatically create 0-D object "arrays". 0-D object arrays will work just fine when using operators (x+y works), but not when explicitly calling the ufuncs (add(x,y) does not work). Both methods work fine for 0-D numerical arrays.
Are the 0-D object arrays necessary for this? The behaviour that Konrad needs is this (highly abstracted): class A: def __add__(self, other): return 0.1 def sin(self): return 0.5 Then:
a = A() a + a 0.10000000000000001 Numeric.add(a,a) 0.10000000000000001 Numeric.sin(a) 0.5
The Numeric ufuncs, if the argument isn't an array, look for a method of the right name (here, sin) on the object, and call that. You could define a delegate class that does this with something like class MathFunctionDelegate: def __init__(self, fallback=Numeric): self._fallback = fallback def add(self, a, b): try: return a + b except TypeError: return self._fallback.add(a, b) def sin(self, x): sin = getattr(x, 'sin', None) if sin is None: return self._fallback.sin(x) else: return sin(x) ... etc. ... (This could be a module, too. This just allows parameterisation.) In ScientificPython, FirstDerivatives.py has a method of the DerivVar class that looks like this: def sin(self): v = Numeric.sin(self.value) d = Numeric.cos(self.value) return DerivVar(v, map(lambda x,f=d: f*x, self.deriv)) Add something like this to the __init__: self._mathfuncs = MathFunctionDelegate(Numeric) and that sin method becomes def sin(self): v = self._mathfuncs.sin(self.value) d = self._mathfuncs.cos(self.value) return DerivVar(v, map(lambda x,f=d: f*x, self.deriv)) That's not quite perfect, as the user has to use a mathfuncs object also; that's why having Numeric or numarray do the delegation automatically is nice. This would work equally well with numarray (or the math or cmath modules!) replacing Numeric. You could get fancy and be polymorphic: choose the right module to use depending on the type of the argument (Numeric arrays use Numeric, floats use math, etc.). If this was a module instead, you could have registration of types. I'll call this module numpy. Here's a possible (low-level) usage: import numpy import Numeric, numarray, math, cmath from Scientific.Functions import Derivatives numpy.register_type(Numeric.arraytype, Numeric) numpy.register_type(numarray.NumArray, numarray) numpy.register_type(float, math) numpy.register_type(complex, cmath) numpy.register_type(Derivatives.DerivVar, Derivates.derivate_math) numpy.default_constructor(numarray.array) a = numpy.array([1,2,3]) # makes a numarray b = Numeric.array([1,2,3]) # Numeric array print numpy.sin(a), numpy.sin(b) Things to consider with this would be: * how to handle a + b * where should the registering of types be done? (Probably by the packages themselves) * more complex predicates for registering handlers? (to handle subclasses, etc.) etc. Ok, I hope that's not too rambling. But the idea is that neither Numeric nor numarray need to provide the delegation ability. -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |cookedm@physics.mcmaster.ca
data:image/s3,"s3://crabby-images/43478/43478571977a15a63ec32fed42201899fe31c8eb" alt=""
On 19.01.2005, at 04:03, David M. Cooke wrote:
That's not quite perfect, as the user has to use a mathfuncs object also; that's why having Numeric or numarray do the delegation automatically is nice.
Exactly. It is an important practical feature of automatic derivatices that you can use with with nearly any existing mathematical code. If you have to import the math functions from somewhere else, then you have to adapt all that code, which in the case of code imported from some other module means rewriting it. More importantly, that approach doesn't scale to larger installations. If two different modules use it to provide generalized math functions, then the math functions of the two will not be interchangeable. In fact, it was exactly that kind of missing unversality that was the motivation for the ufunc code in NumPy ("u" for "universal"). Before NumPy, we had math (for float) and cmath (for complex), but there was no simple way to write code that would accept either float or complex even though that is often useful. Ufuncs would work on float, complex, arrays of either type, and "anything else" through the method call mechanism.
If this was a module instead, you could have registration of types. I'll call this module numpy. Here's a possible (low-level) usage:
Yes, a universal module with a registry would be another viable solution. But the whole community would have to agree on one such module to make it useful.
Things to consider with this would be: * how to handle a + b
a + b is just operator.add(a, b). The same mechanism would work.
* where should the registering of types be done? (Probably by the packages themselves)
Probably. The method call approach has an advantage here: no registry is required. In fact, if we could start all over again, I would argue for a math function module to be part of core Python that does nothing else but converting function calls into method calls. After all, math functions are just syntactic sugar for what functionally *is* a method call. Konrad. -- ------------------------------------------------------------------------ ------- Konrad Hinsen Laboratoire Leon Brillouin, CEA Saclay, 91191 Gif-sur-Yvette Cedex, France Tel.: +33-1 69 08 79 25 Fax: +33-1 69 08 82 61 E-Mail: hinsen@llb.saclay.cea.fr ------------------------------------------------------------------------ -------
data:image/s3,"s3://crabby-images/25590/25590978e6ee8d8320bdf70e2e39cd3e3700b7ab" alt=""
Travis Oliphant
Thanks for the comments that have been made. One of my reasons for commenting is to get an understanding of which design issues of Numarray are felt to be important and which can change. There seems to be this idea that small arrays are not worth supporting. I hope this is just due to time-constraints and not some fundamental idea that small arrays should never be considered with Numarray. Otherwise, there will always be two different array implementations developing at their own pace.
I wouldn't say that we are "hostile" to small arrays. We do only have limited resources and can't do everything we would like. More on this below though.
I really want to gauge how willing developers of numarray are to changing things.
Without going into all the details below, I think I can address this point. I suppose it all depends on what you mean by "how willing developers of numarray are to changing things." If you mean are we open to changes to numarray that speed up small arrays (and address other noted shortcomings). Yes, certainly (so long as they don't hurt the large array issues significantly). If it means we will drop everything and address all these issues immediately ourselves. No, we have other things to do regarding numarray that have higher priority before we can address these things. I would have a very hard time justifying the effort when there are other things needed by STScI more. We would love it if others could address them sooner though. More on related issues below.
1) Are there plans to move the nd array entirely into C? [...] I do not think it would be difficult at this point to move it all to C and then make future changes there (you can always call pure Python code from C). With the structure in place and some experience behind you, now seems like as good a time as any. Especially, because now is a better time for me than any... I like what numarray is doing by not always defaulting to ints with the maybelong type. It is a good idea.
I hope that is true, but we've found doing moving thing to C a bigger effort than we would like. I'd like to be proved wrong by someone who can tackle sooner than we can.
2) Why is the ND array C-structure so large? Why are the dimensions and strides array static? Why can't the extra stuff that the fancy arrays need be another structure and the numarray C structure just extended with a pointer to the extra stuff?
When Todd moved NDArray into C, he tried to keep it simple. As such, it has no "moving parts." We think making dimensions and strides malloc'ed rather than static would be fairly easy. Making the "extra stuff"
variable is something we can look at.
But allocating dimensions and strides when needed is not difficult and it reduces the overhead of the ndarray object. Currently, that overhead seems extreme. I could be over-reacting here, but it just seems like it would have made more sense to expand the array object as little as possible to handle the complexity that you were searching for. It seems like more modifications were needed in the ufunc then in the arrayobject.
I'm not convinced that this is a big issue, but we have no objection to someone making this change. But it falls well below small array performance in priority for us.
The bottom line is that adding the variability adds complexity and we're not sure we understand the storage economics of why we would doing it. Numarray was designed, first and foremost, for large arrays.
Small arrays are never going to disappear (Fernando Perez has an excellent example) and there are others. A design where a single pointer not being NULL is all that is needed to distinguish "simple" Numeric-like arrays from "fancy" numarray-like arrays seems like a great way to make sure that
I won't quarrel with that (but I'm not sure what you are suggesting in the bigger picture).
On another fundamental note, numarray is being sold as a replacement for Numeric. But, then, on closer inspection many things that Numeric does well, numarray is ignoring or not doing very well. I think this presents a certain amount of false advertising to new users, who don't understand the history. Most of them would probably never need the fanciness that numarray provides and would be quite satisfied with Numeric. They just want to know what others are using. I think it is a disservice to call numarray a replacement for Numeric until it actually is. It should currently be called an "alternative implementation" focused on large arrays. This (unintentional) slight of hand that has been occurring over the past year has been my biggest complaint with numarray. Making numarray a replacement for Numeric means that it has to support small arrays, object arrays, and ufuncs at least as well as but preferably better than Numeric. It should also be faster than Numeric whenever possible, because Numeric has lots of potential optimizations that have never been applied. If numarray does not do these things, then in my mind it cannot be a replacement for Numeric and should stop being called that on the numpy web site.
It distresses me to be accused of false advertising. We were pretty up front at the beginning of the process of writing numarray that the approach we would be taking would likely mean slower small array performance. There were those (like you and Eric that expressed concern about that), but it wasn't at all clear what the consensus was regarding how much it could change and be acceptable. (I recall at one point when IDL was ported from Fortran to C which resulted in a factor of 2 overall slowdown in speed. People didn't accuse RSI of providing something that wasn't a replacement for IDL.) The fact was that at the time we started, several thought that backward compatibility wasn't that important. We didn't even try at the beginning to make the C-API the same. At the start, there was no claim that numarray would be an exact replacement for Numeric. (And I didn't hear huge objections at the time on the point and some that actually encouraged a break with how Numeric did things.) Much of the attempts to provide backward compatiblity have come well after the first implementations. We have strove to provide the full functionality of what Numeric had as we went to version 1.0. Sure, there are some holes for object arrays. So the issue of whether numarray is a replacement or not seems to be arguing over what the intent of the project was. Paul Dubois wrote the numpy page that make that reference, and sure, I didn't object to it (But why didn't you at the time? It's been there a long time, and the goals and direction of numarray have been quite visible for a long time. This wasn't some dark, secret project. Many of the things you are complaining about have been true for some time.) If people want to call numarray an alternative implementation, I'm fine with that. It was a replacement in our case. If we didn't develop it, we likely wouldn't be using Python in the full sense that we are now. Numeric wasn't really an option. At the time, many supported the idea of a reimplementation so it seemed like a good opportunity to add what we needed and do that. Obviously, we misread the importance of small array performance for a significant part of the community. (But I keep saying, if small array peformance is really that important, it would seem to me that much bigger wins are available as Fernando mentioned) It's been clear for a better part of a year that it would be a long time before there was any sort of unification between the two. That distressed me as I'm sure it did you. So some sort of useful sharing of libraries and packages seemed like the obvious way to go. In more specialized areas, there would be some divergence (e.g., we have dependencies on record arrays that we just can't provide in Numeric). I can no longer justify sinking many more months of work into numarray for issues of no value to STScI (other than the hope that it would convince others to switch, which isn't clear at all that it would). We need to move towards providing a lot of the tools that are available for Numeric. I can justify that work. The current situation is far from ideal (Paul called it "insane" at scipy if you prefer more colorful language). What we have are two camps that cannot afford to give up the capabilities that are unique to each version. But with most of the C-API compatable, and a way of coding most libraries (except for Ufuncs) to be compatible with both, we certainly can improve the situation. If you can help remove the biggest obstacle, small array performance, so that we could unify the two I would be thrilled, but most of the effort can't come from us, at least not in the near term (next year). We can help at some level. [...]
I never really understood the "code is too complicated" argument
You lost me on this one. You mean the complaint that it was too complicated in Numeric way back?
anyway. I was just wondering if there is some support for reducing the number of source code files, or reorganizing them a bit.
Yes, I'd say that this has relatively high priority. It would be nice to have feedback and advice on how to do this best.
4) Object arrays must be supported. This was a bad oversight and an important feature of Numeric arrays.
The current implementation does support them (though in a different way, and generally not as efficiently, though Todd is more up on the details here). What aspect of object arrays are you finding lacking? C-api?
I did not see such support when I looked at it, but given the previous comment, I could easily have missed where that support is provided. I'm mainly following up on Konrad's comment that his Automatic differentiation does not work with Numarray because of the missing support for object arrays. There are other applications for object arrays as well. Most of the support needs to come from the ufunc side.
I think Robert Kern pointed to the issue in a subsequent message.
Again, thanks to the work that has been done. I'm really interested to see if some of these modifications can be done as in my mind it will help the process of unifying the two camps.
I'm glad to see that you are taking a look at it and welcome the comments and any offers of help in improving speed.
I would be interested in helping if there is support for really making numarray a real replacement for Numeric, by addressing the concerns that I've outlined. As stated at the beginning, I'm really just looking for how receptive numarray developers would be to the kinds of changes I'm talking about: (1) reducing the size of the array structure, (2) moving the ndarray entirely into C, (3) improving support for object arrays, (4) improving ufunc API support.
I'm not exactly sure what you mean by 4). If you mean having a compatible api to numeric, that seem like a lot of work since the way ufuncs work in numarray is quite different. But you may mean something else. Perry
data:image/s3,"s3://crabby-images/43478/43478571977a15a63ec32fed42201899fe31c8eb" alt=""
On 19.01.2005, at 02:56, Perry Greenfield wrote:
It distresses me to be accused of false advertising. We were pretty up front at the beginning of the process of writing numarray that the
It's not you, or the numarray team in general, that is being accused. Actually I doubt that any single person is responsible for the current state of misinformation. Those are the wonders of the OpenSource world. I saw Travis' post more as a request for clarification than an accusation against anyone in particular. As you describe very well, there is a gap between past intents and what has actually happened.
concern about that), but it wasn't at all clear what the consensus was regarding how much it could change and be acceptable. (I recall
It's probably still not clear. Perhaps there is no consensus at all.
The current situation is far from ideal (Paul called it "insane" at scipy if you prefer more colorful language). What we have are two camps that cannot afford to give up the capabilities that are unique to each version. But with most of the C-API compatable, and a way of coding most libraries (except for Ufuncs) to be compatible with both, we certainly can improve the situation.
I am not sure that compatibility is really the main issue. In the typical scientific computing installation, NumPy and numarray are building blocks. Some people use them without even being aware of them, indirectly through other libraries. In a building-block world, two bricks should be either equivalent or be able to coexist. The original intention was to make NumPy and numarray equivalent, but this is not what they are at the moment. But they do not coexist very well either. While it is easy to install both of them, every library that builds on them uses one or the other (and to make it worse, it is not always easy to figure out which one is used if both are available). Sooner or later, anyone who uses multiple libraries that are array clients is going to have a compatibility issue, which will probably be hard to understand because both sides' arrays look so very similar. Konrad. -- ------------------------------------------------------------------------ ------- Konrad Hinsen Laboratoire Leon Brillouin, CEA Saclay, 91191 Gif-sur-Yvette Cedex, France Tel.: +33-1 69 08 79 25 Fax: +33-1 69 08 82 61 E-Mail: hinsen@llb.saclay.cea.fr ------------------------------------------------------------------------ -------
data:image/s3,"s3://crabby-images/25590/25590978e6ee8d8320bdf70e2e39cd3e3700b7ab" alt=""
On Jan 19, 2005, at 7:31 AM, konrad.hinsen@laposte.net wrote:
The current situation is far from ideal (Paul called it "insane" at scipy if you prefer more colorful language). What we have are two camps that cannot afford to give up the capabilities that are unique to each version. But with most of the C-API compatable, and a way of coding most libraries (except for Ufuncs) to be compatible with both, we certainly can improve the situation.
I am not sure that compatibility is really the main issue. In the typical scientific computing installation, NumPy and numarray are building blocks. Some people use them without even being aware of them, indirectly through other libraries.
In a building-block world, two bricks should be either equivalent or be able to coexist. The original intention was to make NumPy and numarray equivalent, but this is not what they are at the
Just to clarify, the intention to make them equivalent was not originally true (and some encouraged the idea that there be a break with Numpy compatibility). But that has grown to be a much bigger goal over time.
moment. But they do not coexist very well either. While it is easy to install both of them, every library that builds on them uses one or the other (and to make it worse, it is not always easy to figure out which one is used if both are available). Sooner or later, anyone who uses multiple libraries that are array clients is going to have a compatibility issue, which will probably be hard to understand because both sides' arrays look so very similar.
No doubt that supporting both introduces more work, but for the most part, I think that with the exception of some parts(namely ufunc C-api), it should be possible write a library that supports both with little conditional code. That does mean not using some features of numarray, or depending some of the different behaviors of Numeric (e.g., scalar coercion rules), so that requires understanding the subsets to use. And that does cost. But one doesn't need to have two separate libraries. In such cases I'm hoping there is no need to mix different flavors of arrays. You either use Numeric arrays consistently or numarrays consistently. And if the two can be unified, then this will just be a intermediate solution. Perry
data:image/s3,"s3://crabby-images/43478/43478571977a15a63ec32fed42201899fe31c8eb" alt=""
On 19.01.2005, at 17:43, Perry Greenfield wrote:
Just to clarify, the intention to make them equivalent was not originally true (and some encouraged the idea that there be a break with Numpy compatibility). But that has grown to be a much bigger goal over time.
If my memory serves me well, the original intention was to have a new implementation that could replace the old one feature- and performancewise but without promising API compatibility. What we have now is the opposite.
No doubt that supporting both introduces more work, but for the most part, I think that with the exception of some parts(namely ufunc C-api), it should be possible write a library that supports both with little conditional code.
Yes, certainly. But not everybody is going to do it, for whatever reasons, if only lack of time or dependencies on exclusive features. So one day, there will be library A that requires NumPy and library B that requires numarray (that day may already have arrived). If I want to use both A and B in my code, I can expect to run into problems and unpleasant debugging sessions. Konrad.
data:image/s3,"s3://crabby-images/25590/25590978e6ee8d8320bdf70e2e39cd3e3700b7ab" alt=""
I'd like to clarify our position on this a bit in case previous messages have given a wrong or incomplete impression. 1) We don't deny that small array performance is important to many users. We understand that. But it generally isn't important for our projects, and in the list of things to do for numarray, we can't give it high priority this year. We have devoted resources to this issue in the past couple of years (but without sufficient success to persuade many to switch for that reason alone), and it is hard to continue to put much more resources into this not knowing whether it will be enough of an improvement to satisfy those that really need it. 2) This doesn't mean that we don't think it isn't important to add as soon as it can be done. That is, we aren't trying to prevent such improvements from being made. 3) We hope that there are people out there for which this is important who would like to see a numarray/Numeric unification, have some experience with the internals of one or the other (or are willing to learn), and are willing to devote the time to help make numarray faster (if you can rewrite everything from scratch and satisfy both worlds, that would make just as happy :-). 4) We are willing to help in the near term as far as helping explain how things currently work, where possible improvements can be made, helping in design discussions, reviewing proposed or actual changes, and doing the testing and integration of such changes. 5) But the onus of doing the actual implementation can't be on us for reasons I've already given. But besides those I think it is important that whoever does this should have a strong stake in the success of this (i.e., the performance improvements are important for their projects). Perry
data:image/s3,"s3://crabby-images/25590/25590978e6ee8d8320bdf70e2e39cd3e3700b7ab" alt=""
On a different note, we will update the numarray home page to better reflect the current situation regard to Numeric, particularly to clarify that there is no official consensus regarding it as a replacement for Numeric (but also to spell out what the differences are so that people wondering about which to use will have a better idea to base their choice on, and to give an idea of what our development plans and priorities are). We're fairly busy at the moment so it may take a few days for such updates to the web page to happen. I'll post a message when that happens so that those interested can look at them and provide comments if they feel they are not accurate. I'll also contact Paul Dubois about updating the numpy page. Perry
data:image/s3,"s3://crabby-images/43478/43478571977a15a63ec32fed42201899fe31c8eb" alt=""
On 18.01.2005, at 19:26, Travis Oliphant wrote:
On another fundamental note, numarray is being sold as a replacement for Numeric. But, then, on closer inspection many things that Numeric does well, numarray is ignoring or not doing very well. I think this presents a certain amount of false advertising to new users, who don't understand the history. Most of them would probably never need the fanciness that
I agree with that. I regularly get questions from people who download my code and then wonder why it "still" uses NumPy instead of the "newer" numarray. The reason is that my code has nothing to gain from numarray, as it uses many small and few if any very large arrays. I have no problem explaining that, but the fact that the question arises shows that there is a wrong perception by many newcomers of the relation between NumPy and numarray.
comment, I could easily have missed where that support is provided. I'm mainly following up on Konrad's comment that his Automatic differentiation does not work with Numarray because of the missing support for object arrays. There are other applications for object arrays as well. Most of the
While I agree that object arrays are useful, they have nothing to do with the missing feature that I mentioned recently. That one concerns only ufuncs. In NumPy, they use a method call when presented with an object type they cannot handle directly. In numarray, they just produce an error message in that case. Returning to object arrays, I have used them occasionally but never in any of my public code, because there have been lots of minor bugs concerning them in all versions of NumPy. It would be nice if numarray could do a better job there. Konrad. -- ------------------------------------------------------------------------ ------- Konrad Hinsen Laboratoire Leon Brillouin, CEA Saclay, 91191 Gif-sur-Yvette Cedex, France Tel.: +33-1 69 08 79 25 Fax: +33-1 69 08 82 61 E-Mail: hinsen@llb.saclay.cea.fr ------------------------------------------------------------------------ -------
data:image/s3,"s3://crabby-images/25590/25590978e6ee8d8320bdf70e2e39cd3e3700b7ab" alt=""
Konrad Hinsen wrote:
comment, I could easily have missed where that support is provided. I'm mainly following up on Konrad's comment that his Automatic differentiation does not work with Numarray because of the missing support for object arrays. There are other applications for object arrays as well. Most of the
While I agree that object arrays are useful, they have nothing to do with the missing feature that I mentioned recently. That one concerns only ufuncs. In NumPy, they use a method call when presented with an object type they cannot handle directly. In numarray, they just produce an error message in that case.
Returning to object arrays, I have used them occasionally but never in any of my public code, because there have been lots of minor bugs concerning them in all versions of NumPy. It would be nice if numarray could do a better job there.
This is a good point. In fact, when we started thinking about implementing object arrays, it looked tricker than it first appeared. One needs to ensure that all the objects referenced in the arrays have their reference counts appropriately adjusted with all operations. At that time it was quite easy to segfault Numeric using object arrays I'm guessing for this reason. Perhaps those problems have since been fixed. I don't recall the exact manipulations that caused the segfaults, but they were simple operations; and I don't know if the same problems remain. Perry
participants (9)
-
Chris Barker
-
cookedm@physics.mcmaster.ca
-
Fernando Perez
-
konrad.hinsen@laposte.net
-
Paul Dubois
-
Perry Greenfield
-
Ralf Juengling
-
Robert Kern
-
Travis Oliphant