getiterator vs xpath question

I am misunderstanding the difference between these two code blocks. I thought they would have the same result. I want to find every element in the tree that has an 'id' or a 'name' attribute so I can store that attribute. (I'm link-checking a static html site).
starting out:
from lxml import etree parser = etree.HTMLParser() tree = etree.parse('ugdet17.htm', parser=parser)
Then using getiterator(),
for elem in tree.getiterator():
... if elem.tag == 'div': ... if elem.get('class') == 'section': ... print elem.attrib ... {'class': 'section', 'id': 'ugseldet'}
And I *thought* the xpath would return the same thing.
for div in tree.xpath('//div[@class="section"]'):
... print div.attrib ... {'class': 'section', 'id': 'ugseldet'} {'class': 'section', 'id': 'ugstep'}
So how do I go through the tree and get each id or name attribute value?
thanks, --Tim Arnold

Tim Arnold, 01.08.2012 00:48:
I am misunderstanding the difference between these two code blocks. I thought they would have the same result. I want to find every element in the tree that has an 'id' or a 'name' attribute so I can store that attribute. (I'm link-checking a static html site).
Have a look at the link iterator in lxml.html. It also handles references and links in CSS, for example.
starting out:
from lxml import etree parser = etree.HTMLParser() tree = etree.parse('ugdet17.htm', parser=parser)
Then using getiterator(),
for elem in tree.getiterator():
... if elem.tag == 'div': ... if elem.get('class') == 'section': ... print elem.attrib ... {'class': 'section', 'id': 'ugseldet'}
And I *thought* the xpath would return the same thing.
for div in tree.xpath('//div[@class="section"]'):
... print div.attrib ... {'class': 'section', 'id': 'ugseldet'} {'class': 'section', 'id': 'ugstep'}
I don't see the difference, either. Could you put the HTML document on a web server somewhere so that others can try to reproduce it? Which of the two results is the one you expected? Is this taken from a single Python session, without reparsing in between?
So how do I go through the tree and get each id or name attribute value?
I'd use this code:
for elem in tree.iter('div'): if elem.get('class') == 'section': print elem.attrib
Likely faster than any of the above.
Stefan
participants (2)
-
Stefan Behnel
-
Tim Arnold