<br><br><div class="gmail_quote">On Fri, Dec 11, 2009 at 3:12 AM, Wolodja Wentland <span dir="ltr"><<a href="mailto:wentland@cl.uni-heidelberg.de">wentland@cl.uni-heidelberg.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hi all,<br>
<br>
I am writing a library for accessing Wikipedia data and include a module<br>
that generates graphs from the Link structure between articles and other<br>
pages (like categories).<br>
<br>
These graphs could easily contain some million nodes which are frequently<br>
linked. The graphs I am building right now have around 300.000 nodes<br>
with an average in/out degree of - say - 4 and already need around 1-2GB of<br>
memory. I use networkx to model the graphs and serialise them to files on<br>
the disk. (using adjacency list format, pickle and/or graphml).<br>
<br>
The recent thread on including a graph library in the stdlib spurred my<br>
interest and introduced me to a number of libraries I have not seen<br>
before. I would like to reevaluate my choice of networkx and need some<br>
help in doing so.<br>
<br>
I really like the API of networkx but have no problem in switching to<br>
another one (right now) .... I have the impression that graph-tool might<br>
be faster and have a smaller memory footprint than networkx, but am<br>
unsure about that.<br>
<br>
Which library would you choose? This decision is quite important for me<br>
as the choice will influence my libraries external interface. Or is<br>
there something like WSGI for graph libraries?<br>
<br>
kind regards<br></blockquote><div><br>I once computed the PageRank of the English Wikipedia. I ended up using the Boost graph library, of which there is a parallel implementation that runs on clusters. I tried to do it using Python but failed as the memory requirements were so large. Boost and the parallel version both have python interfaces.<br>
</div></div><br>