[lxml-dev] Question on clean_html
data:image/s3,"s3://crabby-images/9d882/9d882cf6bb4d72e841d4e36eceae62e980c2a300" alt=""
Hi, I would like to use lxml to remove all tags except 'a' tags. Is this possible? I don't seem to understand the arguments to the Cleaner class. What does allow_tags do? I tried this:
Do I instead have to list all the tags I don't want, except for 'a', in a remove_tags keyword argument? Any hints? Thank you.
data:image/s3,"s3://crabby-images/9b726/9b72613785319981a8800f418b99740492b56b75" alt=""
Brian Neal wrote:
There's not really a way to do this with the Cleaner I'm afraid. (Hrm... I really need to clean up the options there, as they overlap in lots of weird ways and are confusing.) The method .drop_tag could help here, like (untested): for el in list(doc.iter()): if el.tag not in ['a']: el.drop_tag() I'm not 100% sure what happens if you modify the tree in place like this, though I think list() will make it work. -- Ian Bicking : ianb@colorstudy.com : http://blog.ianbicking.org
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Ian Bicking wrote:
It will at least refuse to drop the root element. Running through list(root.iterdescendants()) should work, though, although the above will definitely not result in a valid HTML document. If you are really only interested in a couple of tags without a meaningful structure, you should collect them in a list rather than cutting everything else out of the document (which is quite costly). Stefan
data:image/s3,"s3://crabby-images/9b726/9b72613785319981a8800f418b99740492b56b75" alt=""
Brian Neal wrote:
There's not really a way to do this with the Cleaner I'm afraid. (Hrm... I really need to clean up the options there, as they overlap in lots of weird ways and are confusing.) The method .drop_tag could help here, like (untested): for el in list(doc.iter()): if el.tag not in ['a']: el.drop_tag() I'm not 100% sure what happens if you modify the tree in place like this, though I think list() will make it work. -- Ian Bicking : ianb@colorstudy.com : http://blog.ianbicking.org
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Ian Bicking wrote:
It will at least refuse to drop the root element. Running through list(root.iterdescendants()) should work, though, although the above will definitely not result in a valid HTML document. If you are really only interested in a couple of tags without a meaningful structure, you should collect them in a list rather than cutting everything else out of the document (which is quite costly). Stefan
participants (3)
-
Brian Neal
-
Ian Bicking
-
Stefan Behnel