[lxml-dev] Efficient methods to build a tree out of HTML structure?
Hi all, I was wondering - what would be the most efficient method to access all the elements in the DOM tree, in some order, using lxml.etree? The methods I currently see in the docs return a class like ElementDepthfirstIterator or iterwalk, which have 2 issues - 1) The first has a flat representation of the tree, so I lose child/parent structure 2) Things like iterwalk do return "start" and "end" actions - but instead of first doing an iterwalk and then parsing the results, is there a better way to construct the tree when iterwalk itself is running? Or perhaps there is some method I've missed completely? Quick note on what I'm trying to do - graphically represent the DOM structure of a page using a library like networkX.. Cheers, Viksit
Hi, Viksit Gaur wrote:
2) Things like iterwalk do return "start" and "end" actions - but instead of first doing an iterwalk and then parsing the results, is there a better way to construct the tree when iterwalk itself is running?
I don't understand what you mean here. Are you modifying the tree during the iteration? Or do you think of some kind of pipelining? Stefan
Hi, Stefan Behnel wrote:
Hi,
Viksit Gaur wrote:
2) Things like iterwalk do return "start" and "end" actions - but instead of first doing an iterwalk and then parsing the results, is there a better way to construct the tree when iterwalk itself is running?
I don't understand what you mean here. Are you modifying the tree during the iteration? Or do you think of some kind of pipelining?
Hmm. The problem I face was a method to assign a unique ID to each element on the page. Lets say I construct an iterwalk object. But, during this phase, I would like to not only build the tree, but also add some of my own information to each node (such as a unique ID to each element). I'm not sure how to do this, without extending the etree.so file inside which iterwalk is implemented.. Cheers, Viksit
Stefan
Viksit Gaur wrote:
The problem I face was a method to assign a unique ID to each element on the page.
Lets say I construct an iterwalk object. But, during this phase, I would like to not only build the tree, but also add some of my own information to each node (such as a unique ID to each element).
I still don't understand what you mean with "build the tree". You can't construct a tree and run iterwalk at the same time. iterparse() will do that in case you are parsing. Stefan
Am 16.05.2008 11:56, Stefan Behnel schrieb:
Viksit Gaur wrote:
The problem I face was a method to assign a unique ID to each element on the page.
Lets say I construct an iterwalk object. But, during this phase, I would like to not only build the tree, but also add some of my own information to each node (such as a unique ID to each element).
I still don't understand what you mean with "build the tree". You can't construct a tree and run iterwalk at the same time. iterparse() will do that in case you are parsing. [...]
I think he is talking about his own tree. The tree he is building to visualize the structure of the XML data. HTH, Dennis Benzinger
Hi, Dennis Benzinger wrote:
Am 16.05.2008 11:56, Stefan Behnel schrieb:
Viksit Gaur wrote:
The problem I face was a method to assign a unique ID to each element on the page.
Lets say I construct an iterwalk object. But, during this phase, I would like to not only build the tree, but also add some of my own information to each node (such as a unique ID to each element). I still don't understand what you mean with "build the tree". You can't construct a tree and run iterwalk at the same time. iterparse() will do that in case you are parsing. [...]
I think he is talking about his own tree. The tree he is building to visualize the structure of the XML data.
Ok, but if it's that, then I don't understand why iterating over the tree and adding an id attribute to each node won't do the job. Stefan
participants (3)
-
Dennis Benzinger
-
Stefan Behnel
-
Viksit Gaur