very large graph
hat at se-162.se.wtb.tue.nl
Tue Jun 24 12:58:36 CEST 2008
On 2008-06-24, MRAB <google at mrabarnett.plus.com> wrote:
> On Jun 24, 1:26 am, chrispoliq... at gmail.com wrote:
>> I need to represent the hyperlinks between a large number of HTML
>> files as a graph. My non-directed graph will have about 63,000 nodes
>> and and probably close to 500,000 edges.
>> I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
>> python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
>> generating a file to store the graph, and I have also looked into
>> Graphviz for visualization. I'm just not sure which modules are
>> best. I need to be able to do the following:
Afaik Graphviz is not good at abstracting the graph, which you may need here.
A page with 63,000 circles on it, and 500,000 edges will probably come out of
the printer as a black sheet of paper.
(8"x11" paper, 1 circle is 1/5", then you have only 2200 circles at one sheet.
You need a factor 28 more circles which leads to a circle of about 0.007".)
Even when actual paper format would not be a problem, you will need some
abstraction and/or tools, as you will not find anything manually in an ocean of
One area you may want to start looking for tools is in state graphs, where the
set of possible states of an entire system is unfolded. These things go up to
over a million states, so you only have a 'small' problem there...
>> 1) The names of my nodes are not known ahead of time, so I will
>> extract the title from all the HTML files to name the nodes prior to
>> parsing the files for hyperlinks (edges).
>> 2) Every file will be parsed for links and nondirectional connections
>> will be drawn between the two nodes.
>> 3) The files might link to each other so the graph package needs to
>> be able to check to see if an edge between two nodes already exists,
>> or at least not double draw connections between the two nodes when
>> adding edges.
>> I'm relatively new to graph theory so I would greatly appreciate any
>> suggestions for filetypes. I imagine doing this as a python
>> dictionary with a list for the edges and a node:list paring is out of
>> the question for such a large graph?
> Perhaps a dictionary where the key is a node and the value is a set of
> destination nodes?
For undirected edges, you could make an Edge class and have a set of Edge's
(where two Edge objects are equal when they connect the same nodes).
I don't expect 500,000 elements in a set to be a problem.
More information about the Python-list