
Hi all, I use lxml for a long time and it works fine for me. But now, I get confused about the charset thing. When I want to get the original charset of a html file, I used codes below: file_content = ''.join( [i.rstrip('\r\n ').lstrip() for i in response.readlines()] ) html = lxml.html.fromstring(file_content) for i in html.xpath('head/meta'): print lxml.html.tostring(i) Surprisingly, there's no output of any <meta http-equiv="Content-Type" .. /> element. So, how can I know the original charset of this html? BTW, I used urllib2 to get charset, using the codes below: req = urllib2.Request(url) try: response = urllib2.urlopen(req) except HTTPError, e: print e.code else: print response.headers.getheader('Content-Type') Not every sites return its charset, some sites don't return any charset information. What I gonna do if I really want to know the charset? Thanks, guys. Best wishes, David -- ---------------------------------------------- Attitude determines everything ! ----------------------------------------------

On Sun, 2010-03-28 at 12:09 +0800, David Shieh wrote:
Hi all,
I use lxml for a long time and it works fine for me. But now, I get confused about the charset thing. When I want to get the original charset of a html file, I used codes below:
file_content = ''.join( [i.rstrip('\r\n ').lstrip() for i in response.readlines()] ) html = lxml.html.fromstring(file_content) for i in html.xpath('head/meta'):
xpath('.//meta[@http-equiv="Content-Type"]/@content') I don't know if match with content-type (lower case) if not xpath('.//meta[re:test(@http-equiv, "^Content-Type$", "i")]', namespaces={"re": "http://exslt.org/regular-expressions"})
print lxml.html.tostring(i)
Surprisingly, there's no output of any <meta http-equiv="Content-Type" .. /> element. So, how can I know the original charset of this html? BTW, I used urllib2 to get charset, using the codes below:
req = urllib2.Request(url) try: response = urllib2.urlopen(req) except HTTPError, e: print e.code else: print response.headers.getheader('Content-Type')
Not every sites return its charset, some sites don't return any charset information. What I gonna do if I really want to know the charset?
Thanks, guys.
Best wishes, David -- --------------------------------------------- Attitude determines everything ! ----------------------------------------------
_______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev
-- Sérgio M. B.

On Sun, Mar 28, 2010 at 12:09 AM, David Shieh <mykingheaven@gmail.com> wrote:
Hi all,
I use lxml for a long time and it works fine for me. But now, I get confused about the charset thing. When I want to get the original charset of a html file, I used codes below:
file_content = ''.join( [i.rstrip('\r\n ').lstrip() for i in response.readlines()] ) html = lxml.html.fromstring(file_content) for i in html.xpath('head/meta'): print lxml.html.tostring(i)
Surprisingly, there's no output of any <meta http-equiv="Content-Type" .. /> element. So, how can I know the original charset of this html?
You need to pass the kwarg `include_meta_content_type=True` to `tostring`, or the <meta http-equiv="Content-Type" .. /> tag will always be stripped on the way out --
from lxml.html import fromstring, tostring x=fromstring("""<html><head><meta http-equiv="Content-Type" content="text/html; charset=ASCII"></head></html>""") x.xpath("head/meta") [<Element meta at 2004bb0>] [tostring(u) for u in x.xpath("head/meta")] [''] [tostring(u, include_meta_content_type=True) for u in x.xpath("head/meta")] ['<meta http-equiv="Content-Type" content="text/html; charset=ASCII">']

2010/3/30 Ethan Jucovy <ethan.jucovy@gmail.com>
On Sun, Mar 28, 2010 at 12:09 AM, David Shieh <mykingheaven@gmail.com> wrote:
Hi all,
I use lxml for a long time and it works fine for me. But now, I get confused about the charset thing. When I want to get the original charset of a html file, I used codes below:
file_content = ''.join( [i.rstrip('\r\n ').lstrip() for i in response.readlines()] ) html = lxml.html.fromstring(file_content) for i in html.xpath('head/meta'): print lxml.html.tostring(i)
Surprisingly, there's no output of any <meta http-equiv="Content-Type" .. /> element. So, how can I know the original charset of this html?
You need to pass the kwarg `include_meta_content_type=True` to `tostring`, or the <meta http-equiv="Content-Type" .. /> tag will always be stripped on the way out --
But I really get charset using Sergio's way. I think your method is also great. I will add it in safe. Thanks!
from lxml.html import fromstring, tostring
x=fromstring("""<html><head><meta http-equiv="Content-Type" content="text/html; charset=ASCII"></head></html>""") x.xpath("head/meta") [<Element meta at 2004bb0>] [tostring(u) for u in x.xpath("head/meta")] [''] [tostring(u, include_meta_content_type=True) for u in x.xpath("head/meta")] ['<meta http-equiv="Content-Type" content="text/html; charset=ASCII">']
-- ---------------------------------------------- Attitude determines everything ! ----------------------------------------------
participants (3)
-
David Shieh
-
Ethan Jucovy
-
Sergio Monteiro Basto