Re: [lxml-dev] How to get HTML charset ?

March 30, 2010


      On Sun, Mar 28, 2010 at 12:09 AM, David Shieh <mykingheaven@gmail.com> wrote:
...
Hi all,
I use lxml for a long time and it works fine for me.
But now, I get confused about the charset thing. When I want to get the
original charset of a html file, I used codes below:
        file_content = ''.join(
                [i.rstrip('\r\n ').lstrip() for i in response.readlines()]
            )
        html = lxml.html.fromstring(file_content)
        for i in html.xpath('head/meta'):
            print lxml.html.tostring(i)
Surprisingly, there's no output of any <meta http-equiv="Content-Type" .. />
element. So, how can I know the original charset of this html?
You need to pass the kwarg `include_meta_content_type=True` to
`tostring`, or the <meta http-equiv="Content-Type" .. /> tag will
always be stripped on the way out --
...
...
...
from lxml.html import fromstring, tostring
x=fromstring("""<html><head><meta http-equiv="Content-Type" content="text/html; charset=ASCII"></head></html>""")
x.xpath("head/meta")
[<Element meta at 2004bb0>]
[tostring(u) for u in x.xpath("head/meta")]
['']
[tostring(u, include_meta_content_type=True) for u in x.xpath("head/meta")]
['<meta http-equiv="Content-Type" content="text/html; charset=ASCII">']

Re: [lxml-dev] How to get HTML charset ?

Ethan Jucovy