<br><br><div class="gmail_quote">On Fri, Dec 11, 2009 at 3:12 AM, Wolodja Wentland <span dir="ltr"><<a href="mailto:wentland@cl.uni-heidelberg.de">wentland@cl.uni-heidelberg.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


Hi all,<br>

<br>

I am writing a library for accessing Wikipedia data and include a module<br>

that generates graphs from the Link structure between articles and other<br>

pages (like categories).<br>

<br>

These graphs could easily contain some million nodes which are frequently<br>

linked. The graphs I am building right now have around 300.000 nodes<br>

with an average in/out degree of - say - 4 and already need around 1-2GB of<br>

memory. I use networkx to model the graphs and serialise them to files on<br>

the disk. (using adjacency list format, pickle and/or graphml).<br>

<br>

The recent thread on including a graph library in the stdlib spurred my<br>

interest and introduced me to a number of libraries I have not seen<br>

before. I would like to reevaluate my choice of networkx and need some<br>

help in doing so.<br>

<br>

I really like the API of networkx but have no problem in switching to<br>

another one (right now) .... I have the impression that graph-tool might<br>

be faster and have a smaller memory footprint than networkx, but am<br>

unsure about that.<br>

<br>

Which library would you choose? This decision is quite important for me<br>

as the choice will influence my libraries external interface. Or is<br>

there something like WSGI for graph libraries?<br>

<br>

kind regards<br></blockquote><div><br>I once computed the PageRank of the English Wikipedia. I ended up using the Boost graph library, of which there is a parallel implementation that runs on clusters. I tried to do it using Python but failed as the memory requirements were so large.  Boost and the parallel version both have python interfaces.<br>


</div></div><br>