Hi,

I'm trying to clean a HTML file that contains meta tags. I want the meta tags to be preserved as-is. Unfortunately, the cleaner removes everything except the "name" attribute of the tag. How can I prevent this behavior?

Here is some example source:

import lxml.html.clean

html = """<html>

<head>

</head>

</html>"""

def clean_html(html):

"""Removes parts of HTML unnecessary for processing."""

kill_tags = ["map", "base", "iframe", "select", "noscript"]

kwargs = {"scripts": True,

"javascript": True,

"comments": True,

"style": True,

"links": True,

"meta": False,

"page_structure": False,

"processing_instructions": True,

"embedded": True,

"frames": False,

"forms": False,

"annoying_tags": True,

"kill_tags": kill_tags,

"whitelist_tags": ["meta"]}

cleaner = lxml.html.clean.Cleaner(**kwargs)

cleaned = cleaner.clean_html(unicode(html))

return cleaned

print clean_html(html)

On my system, I see this printed to standard output:

<html>

<head>

</head>

</html>

How can I prevent the cleaner from removing the content attribute?

Cheers,

Michael