
Hi all, I use lxml for a long time and it works fine for me. But now, I get confused about the charset thing. When I want to get the original charset of a html file, I used codes below: file_content = ''.join( [i.rstrip('\r\n ').lstrip() for i in response.readlines()] ) html = lxml.html.fromstring(file_content) for i in html.xpath('head/meta'): print lxml.html.tostring(i) Surprisingly, there's no output of any <meta http-equiv="Content-Type" .. /> element. So, how can I know the original charset of this html? BTW, I used urllib2 to get charset, using the codes below: req = urllib2.Request(url) try: response = urllib2.urlopen(req) except HTTPError, e: print e.code else: print response.headers.getheader('Content-Type') Not every sites return its charset, some sites don't return any charset information. What I gonna do if I really want to know the charset? Thanks, guys. Best wishes, David -- ---------------------------------------------- Attitude determines everything ! ----------------------------------------------

On Sun, 2010-03-28 at 12:09 +0800, David Shieh wrote:
xpath('.//meta[@http-equiv="Content-Type"]/@content') I don't know if match with content-type (lower case) if not xpath('.//meta[re:test(@http-equiv, "^Content-Type$", "i")]', namespaces={"re": "http://exslt.org/regular-expressions"})
-- Sérgio M. B.

On Sun, 2010-03-28 at 12:09 +0800, David Shieh wrote:
xpath('.//meta[@http-equiv="Content-Type"]/@content') I don't know if match with content-type (lower case) if not xpath('.//meta[re:test(@http-equiv, "^Content-Type$", "i")]', namespaces={"re": "http://exslt.org/regular-expressions"})
-- Sérgio M. B.
participants (3)
-
David Shieh
-
Ethan Jucovy
-
Sergio Monteiro Basto