[lxml-dev] Question on clean_html
![](https://secure.gravatar.com/avatar/9ecc68290355104cddd944f427cce277.jpg?s=120&d=mm&r=g)
Hi, I would like to use lxml to remove all tags except 'a' tags. Is this possible? I don't seem to understand the arguments to the Cleaner class. What does allow_tags do? I tried this:
c = Cleaner(allow_tags=('a',), remove_unknown_tags=False) print c.clean_html('<b>Hi</b>') <b>Hi</b>
Do I instead have to list all the tags I don't want, except for 'a', in a remove_tags keyword argument? Any hints? Thank you.
![](https://secure.gravatar.com/avatar/cc8334869c9d2a9e603017f2da805eb3.jpg?s=120&d=mm&r=g)
Brian Neal wrote:
Hi,
I would like to use lxml to remove all tags except 'a' tags. Is this possible?
I don't seem to understand the arguments to the Cleaner class. What does allow_tags do?
I tried this:
c = Cleaner(allow_tags=('a',), remove_unknown_tags=False) print c.clean_html('<b>Hi</b>') <b>Hi</b>
Do I instead have to list all the tags I don't want, except for 'a', in a remove_tags keyword argument?
Any hints? Thank you.
There's not really a way to do this with the Cleaner I'm afraid. (Hrm... I really need to clean up the options there, as they overlap in lots of weird ways and are confusing.) The method .drop_tag could help here, like (untested): for el in list(doc.iter()): if el.tag not in ['a']: el.drop_tag() I'm not 100% sure what happens if you modify the tree in place like this, though I think list() will make it work. -- Ian Bicking : ianb@colorstudy.com : http://blog.ianbicking.org
![](https://secure.gravatar.com/avatar/8b97b5aad24c30e4a1357b38cc39aeaa.jpg?s=120&d=mm&r=g)
Ian Bicking wrote:
for el in list(doc.iter()): if el.tag not in ['a']: el.drop_tag()
I'm not 100% sure what happens if you modify the tree in place like this, though I think list() will make it work.
It will at least refuse to drop the root element. Running through list(root.iterdescendants()) should work, though, although the above will definitely not result in a valid HTML document. If you are really only interested in a couple of tags without a meaningful structure, you should collect them in a list rather than cutting everything else out of the document (which is quite costly). Stefan
participants (3)
-
Brian Neal
-
Ian Bicking
-
Stefan Behnel