Mailman 3 [lxml-dev] How to get HTML charset ? - lxml - The Python XML Toolkit

March 28, 2010

      Hi all,

I use lxml for a long time and it works fine for me.
But now, I get confused about the charset thing. When I want to get the
original charset of a html file, I used codes below:

        file_content = ''.join(
                [i.rstrip('\r\n ').lstrip() for i in response.readlines()]
            )
        html = lxml.html.fromstring(file_content)
        for i in html.xpath('head/meta'):
            print lxml.html.tostring(i)

Surprisingly, there's no output of any <meta http-equiv="Content-Type" .. />
element. So, how can I know the original charset of this html?
BTW, I used urllib2 to get charset, using the codes below:

    req = urllib2.Request(url)
    try:
        response = urllib2.urlopen(req)
    except HTTPError, e:
        print e.code
    else:
        print response.headers.getheader('Content-Type')

Not every sites return its charset, some sites don't return any charset
information.
What I gonna do if I really want to know the charset?

Thanks, guys.

Best wishes,
David
-- 
----------------------------------------------
Attitude determines everything !
----------------------------------------------

[lxml-dev] How to get HTML charset ?

David Shieh

Sergio Monteiro Basto

Ethan Jucovy

David Shieh

tags

participants (3)