Hi,
while implementing a script that should extract a few things from
HTML pages, I've run into two problems with lxml.html. I'm not sure
if those are actually bugs or if I am doing something wrong, so I
haven't created entries in the bug tracker yet.
Problem 1: Top level comment has no parent
,----
| >>> html = "<!-- comment --><html><head><title>foo</title></head></html>"
| >>> tree = lxml.html.fromstring(html)
| >>> [tag.drop_tree() for tag in html.xpath("//comment()")]
| Traceback (most recent call last):
| File "<stdin>", line 1, in <module>
| File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 169,
| in drop_tree
| assert parent is not None
`----
This method to remove some elements works for any comment or
element, just not if the comment is on top level (which
unfortunately happens on exactly those pages that I'm processing).
Is that behaviour related to the problems mentioned in the FAQ entry
"Why can't I just delete parents or clear the root node in
iterparse()"? I have tried to use tree.remove(tag) as well, but then
I get a ValueError: Element is not a child of this node.
The reason behind removing the comment elements (and others) is that
I would otherwise get them returned by itertext().
Problem 2: lxml.html.clean.Cleaner removes meta content
,----
| >>> html = "<html><head><meta name=\"keywords\" content=\"foo\"></head></html>"
| >>> cleaner = lxml.html.clean.Cleaner()
| >>> cleaner.page_structure = False
| >>> cleaner.meta = False
| >>> tree = lxml.html.fromstring(html)
| >>> cleaner(tree)
| >>> lxml.html.tostring(tree)
| '<html><head><meta name="keywords"></head></html>'
`----
To work around the above problem with the comments, I tried to use
the cleaner to get rid of the comments (which works fine), but this
would then kill the content attributes of the meta tags. Is that
intended?
The used version numbers (I've seen that there are slightly newer
versions available, but the changelogs didn't mention anything that
looked like those problems could be affected):
Python : sys.version_info(major=2, minor=7, micro=2, releaselevel='final', serial=0)
lxml.etree : (2, 3, 0, 0)
libxml used : (2, 7, 8)
libxml compiled : (2, 7, 8)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)
Thanks,
Adalbert