Mailman 3 [lxml-dev] Efficient methods to build a tree out of HTML structure? - lxml - The Python XML Toolkit

newer
[lxml-dev] bug: objectify removes...

[lxml-dev] Efficient methods to build a tree out of HTML structure?

older
[lxml-dev] install lxml 2.0.5 on...

Viksit Gaur

15 May 2008 15 May '08

10:58 p.m.

Hi all, I was wondering - what would be the most efficient method to access all the elements in the DOM tree, in some order, using lxml.etree? The methods I currently see in the docs return a class like ElementDepthfirstIterator or iterwalk, which have 2 issues - 1) The first has a flat representation of the tree, so I lose child/parent structure 2) Things like iterwalk do return "start" and "end" actions - but instead of first doing an iterwalk and then parsing the results, is there a better way to construct the tree when iterwalk itself is running? Or perhaps there is some method I've missed completely? Quick note on what I'm trying to do - graphically represent the DOM structure of a page using a library like networkX.. Cheers, Viksit

Show replies by date

Stefan Behnel

16 May 16 May

5:14 a.m.

New subject: [lxml-dev] Efficient methods to build a tree out of HTML structure?

Hi, Viksit Gaur wrote:

...

2) Things like iterwalk do return "start" and "end" actions - but instead of first doing an iterwalk and then parsing the results, is there a better way to construct the tree when iterwalk itself is running?

I don't understand what you mean here. Are you modifying the tree during the iteration? Or do you think of some kind of pipelining? Stefan

Viksit Gaur

5:28 a.m.

New subject: [lxml-dev] Efficient methods to build a tree out of HTML structure?

Hi, Stefan Behnel wrote:

...

Hi,

Viksit Gaur wrote:

...
2) Things like iterwalk do return "start" and "end" actions - but instead of first doing an iterwalk and then parsing the results, is there a better way to construct the tree when iterwalk itself is running?

I don't understand what you mean here. Are you modifying the tree during the iteration? Or do you think of some kind of pipelining?

Hmm. The problem I face was a method to assign a unique ID to each element on the page. Lets say I construct an iterwalk object. But, during this phase, I would like to not only build the tree, but also add some of my own information to each node (such as a unique ID to each element). I'm not sure how to do this, without extending the etree.so file inside which iterwalk is implemented.. Cheers, Viksit

...

Stefan

Stefan Behnel

5:56 a.m.

New subject: [lxml-dev] Efficient methods to build a tree out of HTML structure?

Viksit Gaur wrote:

...

The problem I face was a method to assign a unique ID to each element on the page.

Lets say I construct an iterwalk object. But, during this phase, I would like to not only build the tree, but also add some of my own information to each node (such as a unique ID to each element).

I still don't understand what you mean with "build the tree". You can't construct a tree and run iterwalk at the same time. iterparse() will do that in case you are parsing. Stefan

Dennis Benzinger

6:28 a.m.

New subject: [lxml-dev] Efficient methods to build a tree out of HTML structure?

Am 16.05.2008 11:56, Stefan Behnel schrieb:

...

Viksit Gaur wrote:

...
The problem I face was a method to assign a unique ID to each element on the page.

Lets say I construct an iterwalk object. But, during this phase, I would like to not only build the tree, but also add some of my own information to each node (such as a unique ID to each element).

I still don't understand what you mean with "build the tree". You can't construct a tree and run iterwalk at the same time. iterparse() will do that in case you are parsing. [...]

I think he is talking about his own tree. The tree he is building to visualize the structure of the XML data. HTH, Dennis Benzinger

Stefan Behnel

6:46 a.m.

New subject: [lxml-dev] Efficient methods to build a tree out of HTML structure?

Hi, Dennis Benzinger wrote:

...

Am 16.05.2008 11:56, Stefan Behnel schrieb:

...
Viksit Gaur wrote:

...
The problem I face was a method to assign a unique ID to each element on the page.

Lets say I construct an iterwalk object. But, during this phase, I would like to not only build the tree, but also add some of my own information to each node (such as a unique ID to each element). I still don't understand what you mean with "build the tree". You can't construct a tree and run iterwalk at the same time. iterparse() will do that in case you are parsing. [...]

I think he is talking about his own tree. The tree he is building to visualize the structure of the XML data.

Ok, but if it's that, then I don't understand why iterating over the tree and adding an id attribute to each node won't do the job. Stefan

6087

Age (days ago)

6087

Last active (days ago)

List overview

Download

5 comments

3 participants

participants (3)

Dennis Benzinger
Stefan Behnel
Viksit Gaur

[lxml-dev] Efficient methods to build a tree out of HTML structure?

Viksit Gaur

Stefan Behnel

Viksit Gaur

Stefan Behnel

Dennis Benzinger

Stefan Behnel

tags

participants (3)