Hi,
I'm trying to clean a HTML file that contains meta tags. I want the meta
tags to be preserved as-is. Unfortunately, the cleaner removes everything
except the "name" attribute of the tag. How can I prevent this behavior?
Here is some example source:
import lxml.html.clean
html = """<html>
<head>
<meta name="keywords" content="test">
</head>
</html>"""
def clean_html(html):
"""Removes parts of HTML unnecessary for processing."""
kill_tags = ["map", "base", "iframe", "select", "noscript"]
kwargs = {"scripts": True,
"javascript": True,
"comments": True,
"style": True,
"links": True,
"meta": False,
"page_structure": False,
"processing_instructions": True,
"embedded": True,
"frames": False,
"forms": False,
"annoying_tags": True,
"kill_tags": kill_tags,
"whitelist_tags": ["meta"]}
cleaner = lxml.html.clean.Cleaner(**kwargs)
cleaned = cleaner.clean_html(unicode(html))
return cleaned
print clean_html(html)
On my system, I see this printed to standard output:
<html>
<head>
<meta name="keywords">
</head>
</html>
How can I prevent the cleaner from removing the content attribute?
Cheers,
Michael