
Hi, I'm trying to clean a HTML file that contains meta tags. I want the meta tags to be preserved as-is. Unfortunately, the cleaner removes everything except the "name" attribute of the tag. How can I prevent this behavior? Here is some example source: import lxml.html.clean html = """<html> <head> <meta name="keywords" content="test"> </head> </html>""" def clean_html(html): """Removes parts of HTML unnecessary for processing.""" kill_tags = ["map", "base", "iframe", "select", "noscript"] kwargs = {"scripts": True, "javascript": True, "comments": True, "style": True, "links": True, "meta": False, "page_structure": False, "processing_instructions": True, "embedded": True, "frames": False, "forms": False, "annoying_tags": True, "kill_tags": kill_tags, "whitelist_tags": ["meta"]} cleaner = lxml.html.clean.Cleaner(**kwargs) cleaned = cleaner.clean_html(unicode(html)) return cleaned print clean_html(html) On my system, I see this printed to standard output: <html> <head> <meta name="keywords"> </head> </html> How can I prevent the cleaner from removing the content attribute? Cheers, Michael
participants (1)
-
Misha Penkov