Mailman 3 Graphs in numarray? - NumPy-Discussion

Graphs in numarray?

Magnus Lie Hetland

April 17, 2002

7:32 a.m.

I'm looking at various ways of implementing graphs in Python (beyond simple dict-based stuff -- more performance is needed). kjbuckets looks like a nice alternative, as does the Boost Graph Library (not sure how easy it is to use with Boost.Python) but if numarray is to become a part of the standard library, it could be beneficial to use that... For dense graphs, it makes sense to use an adjacency matrix directly in numarray, I should think. (I haven't implemented many graph algorithms with ufuncs yet, but it seems doable...) For sparse graphs I guess some sort of sparse array implementation would be useful, although the archives indicate that creating such a thing isn't a core part of the numarray project. What do you think -- is it reasonable to use numarray for graph algorithms? Perhaps an additional module with standard graph algorithms would be interesting? (I'm sure I could contribute some if there is any interest...) And -- is there any chance of getting sparse matrices in numarray? -- Magnus Lie Hetland The Anygui Project http://hetland.org http://anygui.org

Show replies by date

Perry Greenfield

April 2002

12:10 p.m.

Hi Magnus, On Behalf Of Magnus Lie Hetland

...

I'm looking at various ways of implementing graphs in Python (beyond simple dict-based stuff -- more performance is needed). kjbuckets looks like a nice alternative, as does the Boost Graph Library (not sure how easy it is to use with Boost.Python) but if numarray is to become a part of the standard library, it could be beneficial to use that...

For dense graphs, it makes sense to use an adjacency matrix directly in numarray, I should think. (I haven't implemented many graph algorithms with ufuncs yet, but it seems doable...) For sparse graphs I guess some sort of sparse array implementation would be useful, although the archives indicate that creating such a thing isn't a core part of the numarray project.

First of all, it may make sense, but I should say a few words about what scale sizes make sense. Currently numarray is implemented mostly in Python (excepting the very low level, very simple C functions that do the computational and indexing loops. This means it currently has a pretty sizable overhead to set up an array operation (I'm guessing an order of magnitude slower than Numeric). Once set up, it generally is pretty fast. So it is pretty good for very large data sets. Very lousy for very small ones. We haven't measured efficiency lately (we are deferring optimization until we have all the major functionality present first), but I wouldn't be at all surprised to find that the set up time can be equal to the time to actually process ~10,000-20,000 elements (i.e., the time spent per element for a 10K array is roughly half that for much larger arrays. So if you are working with much smaller arrays than 10K, you won't see total execution time decrease much (it was already spending half its time in setup, which doesn't change). We would like to reduce this size threshhold in the future, either by optimizing the Python code, or moving some of it into C. This optimization wouldn't be for at least a couple more months; we have more urgent features to deal with. I doubt that we will ever surpass the current Numeric in its performance on small arrays (though who knows, perhaps we can come close).

...

What do you think -- is it reasonable to use numarray for graph algorithms? Perhaps an additional module with standard graph algorithms would be interesting? (I'm sure I could contribute some if there is any interest...)

Before I go further, I need to find out if the preceeding has made you gasp in horror or if the timescale is too slow for you to accept. (This particular issue also makes me wonder if numarray would ever be a suitable substitute for the existing array module). What size graphs are you most concerned about as far as speed goes?

...

And -- is there any chance of getting sparse matrices in numarray?

Since talk is cheap, yes :-). But I doubt it would be in the "core" and some thought would have to be given to how best to represent them. In one sense, since the underlying storage is different than numarray assumes for all its arrays, sparse arrays don't really share the same underlying C machinery very well. While it certainly would be possible to devise a class with the same interface as numarray objects, the implementation may have to be completely different. On the other hand, since numarray has much better support for index arrays, i.e., an array of indices that may be used to index another array of values, index array(s), value array pair may itself serve as a storage model for sparse arrays. One still needs to implement ufuncs and other functions (including simple things like indexing) using different machinery. It is something that would be nice to have, but I can't say when we would get around to it and don't want to raise hopes about how quickly it would appear. Perry

Magnus Lie Hetland

7:55 a.m.

Perry Greenfield <perry@stsci.edu>: [snip]

...

First of all, it may make sense, but I should say a few words about what scale sizes make sense. [snip] So if you are working with much smaller arrays than 10K, you won't see total execution time decrease much

In relation to what? Using dictionaries etc? Using the array module? [snip]

...

Before I go further, I need to find out if the preceeding has made you gasp in horror or if the timescale is too slow for you to accept.

Hm. If you need 10000 elements before numarray pays off, I'm starting to wonder if I can use it for anything at all. :I

...

(This particular issue also makes me wonder if numarray would ever be a suitable substitute for the existing array module).

Indeed.

...

What size graphs are you most concerned about as far as speed goes?

I'm not sure. A wide range, I should imagine. But with only 100 nodes, I'll get 10000 entries in the adjacency matrix, so perhaps it's worthwile anyway?

...

...
And -- is there any chance of getting sparse matrices in numarray?

Since talk is cheap, yes :-). But I doubt it would be in the "core" and some thought would have to be given to how best to represent them. In one sense, since the underlying storage is different than numarray assumes for all its arrays, sparse arrays don't really share the same underlying C machinery very well. While it certainly would be possible to devise a class with the same interface as numarray objects, the implementation may have to be completely different.

Yes, I realise that.

...

On the other hand, since numarray has much better support for index arrays, i.e., an array of indices that may be used to index another array of values, index array(s), value array pair may itself serve as a storage model for sparse arrays.

That's an interesting idea, although I don't quite see how it would help in the case of adjacency matrices. (You'd still need at least one n**2 size matrix for n nodes, wouldn't you -- i.e. the index array... Right?)

...

One still needs to implement ufuncs and other functions (including simple things like indexing) using different machinery. It is something that would be nice to have, but I can't say when we would get around to it and don't want to raise hopes about how quickly it would appear.

No - no problem. Basically, I'm looking for a platform to implement graph algorithms that doesn't necessitate too many installed packages etc. numarray seemed promising since it's a candidate for inclusion in the standard library. I guess I'll just have to do some timing experiments...

...

Perry

-- Magnus Lie Hetland The Anygui Project http://hetland.org http://anygui.org

Perry Greenfield

8:22 a.m.

...

Behalf Of Magnus Lie Hetland Perry Greenfield <perry@stsci.edu>: [snip]

...
First of all, it may make sense, but I should say a few words about what scale sizes make sense. [snip] So if you are working with much smaller arrays than 10K, you won't see total execution time decrease much

In relation to what? Using dictionaries etc? Using the array module?

No, in relation to operations on a 10K array. Basically, if an operation on a 10K array spends half its time on set up, operations on a 10 element array may only be twice as fast. I'm not making any claims about speed in relation to any other data structure (other than Numeric)

...

[snip]

...
Before I go further, I need to find out if the preceeding has made you gasp in horror or if the timescale is too slow for you to accept.

Hm. If you need 10000 elements before numarray pays off, I'm starting to wonder if I can use it for anything at all. :I

I didn't make clear that this threshold may improve in the future (decrease). The corresponding threshold for Numeric is probably around 1000 to 2000 elements. (Likewise, operations on 10 element Numeric arrays are only about twice as fast as for 1K arrays) We may be able to eventually improve numarray performance to something in that neighborhood (if we are luckly) but I would be surprised to do much better (though if we use caching techniques, perhaps repeated cases of arrays of identical shape, strides, type, etc. may run much faster on subsequent operations). As usual, performance issues can be complicated. You have to keep in mind that Numeric and numarray provide much richer indexing and conversion handling feature than something like the array module, and that comes at some price in performance for small arrays.

...

...
(This particular issue also makes me wonder if numarray would ever be a suitable substitute for the existing array module).

Indeed.

...
What size graphs are you most concerned about as far as speed goes?

I'm not sure. A wide range, I should imagine. But with only 100 nodes, I'll get 10000 entries in the adjacency matrix, so perhaps it's worthwile anyway?

That's right, a 100 nodes is where performance is being competitive, and if you feel you are worried about cases larger than that, then it isn't a problem. But if you are operating mostly on small graphs, then it may not be appropriate. The corresponding threshold for numeric would be on the order of 30 nodes.

...

...
On the other hand, since numarray has much better support for index arrays, i.e., an array of indices that may be used to index another array of values, index array(s), value array pair may itself serve as a storage model for sparse arrays.

That's an interesting idea, although I don't quite see how it would help in the case of adjacency matrices. (You'd still need at least one n**2 size matrix for n nodes, wouldn't you -- i.e. the index array... Right?)

Right.

...

Magnus Lie Hetland

8:48 a.m.

Perry Greenfield <perry@stsci.edu>:

...

[snip]

...

...
In relation to what? Using dictionaries etc? Using the array module?

No, in relation to operations on a 10K array. Basically, if an operation on a 10K array spends half its time on set up, operations on a 10 element array may only be twice as fast. I'm not making any claims about speed in relation to any other data structure (other than Numeric)

Aaah! Sorry to be so dense :) But the speedup in numeric between different sizes isn't as important to me as the speedup compared to other solutions (such as a dict-based one) of course... If a 10 element array is only twice as fast as a 10K array that's no problem if it's still faster than an alternative solution (though I'm sure it might not be...) The same goes for 10K element graphs -- the interesting point has to be whether it's faster than various alternatives (which I'm sure it is).

...

...
[snip]

...
Before I go further, I need to find out if the preceeding has made you gasp in horror or if the timescale is too slow for you to accept.

Hm. If you need 10000 elements before numarray pays off, I'm starting to wonder if I can use it for anything at all. :I

I didn't make clear that this threshold may improve in the future (decrease).

Right. Good. And -- on small graphs performance probably won't be much of a problem anyway. :)

...

The corresponding threshold for Numeric is probably around 1000 to 2000 elements. (Likewise, operations on 10 element Numeric arrays are only about twice as fast as for 1K arrays) We may be able to eventually improve numarray performance to something in that neighborhood (if we are luckly) but I would be surprised to do much better (though if we use caching techniques, perhaps repeated cases of arrays of identical shape, strides, type, etc. may run much faster on subsequent operations). As usual, performance issues can be complicated. You have to keep in mind that Numeric and numarray provide much richer indexing and conversion handling feature than something like the array module, and that comes at some price in performance for small arrays.

Of course. I guess an alternative (for the graph situation) could be to wrap the graphs with a common interface with various implementations, so that a solution more optimised for small graphs could be used (in a factory function) if the graph is small... (Not really an issue for me at the moment, but should be easy to do, I guess.) [snip]

...

...
I'm not sure. A wide range, I should imagine. But with only 100 nodes, I'll get 10000 entries in the adjacency matrix, so perhaps it's worthwile anyway?

That's right, a 100 nodes is where performance is being competitive, and if you feel you are worried about cases larger than that, then it isn't a problem.

Seems probable. For smaller problems I wouldn't be thinking in terms of numarray anyway, I think. (Just using plain Python dicts or something similar.) [snip]

...

...
...
On the other hand, since numarray has much better support for index arrays, i.e., an array of indices that may be used to index another array of values, index array(s), value array pair may itself serve as a storage model for sparse arrays.

That's an interesting idea, although I don't quite see how it would help in the case of adjacency matrices. (You'd still need at least one n**2 size matrix for n nodes, wouldn't you -- i.e. the index array... Right?)

Right.

I might as well use a full adjacency matrix, then... So, the conclusion for now is that numarray may well be suited for working with relatively large (100+ nodes), relatively dense graphs. Now, the next interesting question is how much of the standard graph algorithms can be implemented with ufuncs and array operations (which I guess is the key to performance) and not straight for-loops... After all, some of them are quite sequential. -- Magnus Lie Hetland The Anygui Project http://hetland.org http://anygui.org

rob

9:18 a.m.

I'm sorry I missed the original post, but the topic is important for me. I use the lightweight 3d volume renderer Animabob for most everything. The interface code is in all of the FDTD programs in my website. You just unwind a 3d array and scale it to +/- 128, turn it into chararacters, and you have the input file. I wish Animabob could somehow be turned into a Python package, as in Windows you need Cygwin to run it. I've tried other 3d packages like OpenDX, and they seem to be huge albatrosses. -- ----------------------------- The Numeric Python EM Project www.pythonemproject.com

Perry Greenfield

10:36 a.m.

Behalf Of rob

...

I'm sorry I missed the original post, but the topic is important for me. I use the lightweight 3d volume renderer Animabob for most everything. The interface code is in all of the FDTD programs in my website. You just unwind a 3d array and scale it to +/- 128, turn it into chararacters, and you have the input file. I wish Animabob could somehow be turned into a Python package, as in Windows you need Cygwin to run it. I've tried other 3d packages like OpenDX, and they seem to be huge albatrosses.

It sound like you are trying to do something different than Magnus, but if what you are looking to scale floating or int data to byte size and apply some character mapping, numarray (or Numeric) should be able to do that very well. If that is all you want done, you might find either to be overkill though (if you already wrote a C extension to do so). Perry

Perry Greenfield

10:39 a.m.

...

Now, the next interesting question is how much of the standard graph algorithms can be implemented with ufuncs and array operations (which I guess is the key to performance) and not straight for-loops... After all, some of them are quite sequential.

I'm not sure about that (not being very familiar with graph algorithms). If you can give me some examples (perhaps off the mailing list) I could say whether they are easily cast into ufunc or library calls. Perry

Morten Kjeldgaard

4:21 a.m.

...

simple dict-based stuff -- more performance is needed). kjbuckets looks like a nice alternative, as does the Boost Graph Library (not

Kjbuckets is *very* nice indeed. It is a compact and very fast implementation. I don't see why you'd want to wrap this functionality into NumPy, which has a very well-defined scope and an efficient implentation. It would be a shame to bloat it with something which is discretely different. I have modified kjbuckets so that it compiles and works with Python 2.x. You can pick it up at ftp://xray.imsb.au.dk /pub/birdwash/packages/Python2.1/SRPMS/python-kjbuckets-2.2-7.src.rpm Just do "rpm --rebuild" on it. I sent the patch to the original author, but it appears he is no longer maintaining it. Never mind, it works great. /Morten -- Morten Kjeldgaard <mok@imsb.au.dk> | Phone : +45 89 42 50 26 Institute of Molecular and Structural Biology | Fax : +45 86 12 31 78 Aarhus University | Home : +45 86 18 81 80 Gustav Wieds Vej 10 C, DK-8000 Aarhus C, Denmark | http://imsb.au.dk/~mok

Magnus Lie Hetland

7:28 a.m.

Morten Kjeldgaard <mok@imsb.au.dk>:

...

...
simple dict-based stuff -- more performance is needed). kjbuckets looks like a nice alternative, as does the Boost Graph Library (not

Kjbuckets is *very* nice indeed.

Yes, I guess it is. But the project doesn't seem very active...

...

It is a compact and very fast implementation. I don't see why you'd want to wrap this functionality into NumPy, which has a very well-defined scope and an efficient implentation. It would be a shame to bloat it with something which is discretely different.

Yes, I guess you're right. There is no point in adding this sort of thing to numarray. My motivation for using numarray in my implementations was simply that it would mean that the necessery tools would be (or might be in the future ;) available in the standard distribution.

...

I have modified kjbuckets so that it compiles and works with Python 2.x. You can pick it up at

ftp://xray.imsb.au.dk /pub/birdwash/packages/Python2.1/SRPMS/python-kjbuckets-2.2-7.src.rpm

Just do "rpm --rebuild" on it.

I sent the patch to the original author, but it appears he is no longer maintaining it. Never mind, it works great.

Well... I do sort of mind... I'm a bit wary of using unmaintained software. Not that I would never do it or anything... But I think it would be a bonus to use stuff that is being actively maintained and developed. But I guess I'll take another look at it. (Any idea where the "kj" prefix comes from, by the way?)

...