[lxml-dev] Question on clean_html

Hi, I would like to use lxml to remove all tags except 'a' tags. Is this possible? I don't seem to understand the arguments to the Cleaner class. What does allow_tags do? I tried this:
Do I instead have to list all the tags I don't want, except for 'a', in a remove_tags keyword argument? Any hints? Thank you.

Brian Neal wrote:
There's not really a way to do this with the Cleaner I'm afraid. (Hrm... I really need to clean up the options there, as they overlap in lots of weird ways and are confusing.) The method .drop_tag could help here, like (untested): for el in list(doc.iter()): if el.tag not in ['a']: el.drop_tag() I'm not 100% sure what happens if you modify the tree in place like this, though I think list() will make it work. -- Ian Bicking : ianb@colorstudy.com : http://blog.ianbicking.org

Ian Bicking wrote:
It will at least refuse to drop the root element. Running through list(root.iterdescendants()) should work, though, although the above will definitely not result in a valid HTML document. If you are really only interested in a couple of tags without a meaningful structure, you should collect them in a list rather than cutting everything else out of the document (which is quite costly). Stefan

Brian Neal wrote:
There's not really a way to do this with the Cleaner I'm afraid. (Hrm... I really need to clean up the options there, as they overlap in lots of weird ways and are confusing.) The method .drop_tag could help here, like (untested): for el in list(doc.iter()): if el.tag not in ['a']: el.drop_tag() I'm not 100% sure what happens if you modify the tree in place like this, though I think list() will make it work. -- Ian Bicking : ianb@colorstudy.com : http://blog.ianbicking.org

Ian Bicking wrote:
It will at least refuse to drop the root element. Running through list(root.iterdescendants()) should work, though, although the above will definitely not result in a valid HTML document. If you are really only interested in a couple of tags without a meaningful structure, you should collect them in a list rather than cutting everything else out of the document (which is quite costly). Stefan
participants (3)
-
Brian Neal
-
Ian Bicking
-
Stefan Behnel