Mailman 3 Why is meta tag attributes removed by Cleaner? - lxml - The Python XML Toolkit

Dec. 28, 2015

      Hi,

I'm trying to clean a HTML file that contains meta tags. I want the meta
tags to be preserved as-is. Unfortunately, the cleaner removes everything
except the "name" attribute of the tag. How can I prevent this behavior?

Here is some example source:

import lxml.html.clean
html = """<html>
  <head>
    <meta name="keywords" content="test">
  </head>
</html>"""

def clean_html(html):
    """Removes parts of HTML unnecessary for processing."""
    kill_tags = ["map", "base", "iframe", "select", "noscript"]
    kwargs = {"scripts": True,
              "javascript": True,
              "comments": True,
              "style": True,
              "links": True,
              "meta": False,
              "page_structure": False,
              "processing_instructions": True,
              "embedded": True,
              "frames": False,
              "forms": False,
              "annoying_tags": True,
              "kill_tags": kill_tags,
              "whitelist_tags": ["meta"]}
    cleaner = lxml.html.clean.Cleaner(**kwargs)
    cleaned = cleaner.clean_html(unicode(html))
    return cleaned

print clean_html(html)

On my system, I see this printed to standard output:

<html>
  <head>
    <meta name="keywords">
  </head>
</html>

How can I prevent the cleaner from removing the content attribute?

Cheers,
Michael

Why is meta tag attributes removed by Cleaner?

Misha Penkov

tags

participants (1)