
Hi Magnus, On Behalf Of Magnus Lie Hetland
I'm looking at various ways of implementing graphs in Python (beyond simple dict-based stuff -- more performance is needed). kjbuckets looks like a nice alternative, as does the Boost Graph Library (not sure how easy it is to use with Boost.Python) but if numarray is to become a part of the standard library, it could be beneficial to use that...
For dense graphs, it makes sense to use an adjacency matrix directly in numarray, I should think. (I haven't implemented many graph algorithms with ufuncs yet, but it seems doable...) For sparse graphs I guess some sort of sparse array implementation would be useful, although the archives indicate that creating such a thing isn't a core part of the numarray project.
First of all, it may make sense, but I should say a few words about what scale sizes make sense. Currently numarray is implemented mostly in Python (excepting the very low level, very simple C functions that do the computational and indexing loops. This means it currently has a pretty sizable overhead to set up an array operation (I'm guessing an order of magnitude slower than Numeric). Once set up, it generally is pretty fast. So it is pretty good for very large data sets. Very lousy for very small ones. We haven't measured efficiency lately (we are deferring optimization until we have all the major functionality present first), but I wouldn't be at all surprised to find that the set up time can be equal to the time to actually process ~10,000-20,000 elements (i.e., the time spent per element for a 10K array is roughly half that for much larger arrays. So if you are working with much smaller arrays than 10K, you won't see total execution time decrease much (it was already spending half its time in setup, which doesn't change). We would like to reduce this size threshhold in the future, either by optimizing the Python code, or moving some of it into C. This optimization wouldn't be for at least a couple more months; we have more urgent features to deal with. I doubt that we will ever surpass the current Numeric in its performance on small arrays (though who knows, perhaps we can come close).
What do you think -- is it reasonable to use numarray for graph algorithms? Perhaps an additional module with standard graph algorithms would be interesting? (I'm sure I could contribute some if there is any interest...)
Before I go further, I need to find out if the preceeding has made you gasp in horror or if the timescale is too slow for you to accept. (This particular issue also makes me wonder if numarray would ever be a suitable substitute for the existing array module). What size graphs are you most concerned about as far as speed goes?
And -- is there any chance of getting sparse matrices in numarray?
Since talk is cheap, yes :-). But I doubt it would be in the "core" and some thought would have to be given to how best to represent them. In one sense, since the underlying storage is different than numarray assumes for all its arrays, sparse arrays don't really share the same underlying C machinery very well. While it certainly would be possible to devise a class with the same interface as numarray objects, the implementation may have to be completely different. On the other hand, since numarray has much better support for index arrays, i.e., an array of indices that may be used to index another array of values, index array(s), value array pair may itself serve as a storage model for sparse arrays. One still needs to implement ufuncs and other functions (including simple things like indexing) using different machinery. It is something that would be nice to have, but I can't say when we would get around to it and don't want to raise hopes about how quickly it would appear. Perry