data:image/s3,"s3://crabby-images/7cedd/7cedd8b9a708aa4a0ef84afdcef6f5d5c16fd9b8" alt=""
On Sun, Mar 28, 2010 at 12:09 AM, David Shieh <mykingheaven@gmail.com> wrote:
Hi all,
I use lxml for a long time and it works fine for me. But now, I get confused about the charset thing. When I want to get the original charset of a html file, I used codes below:
file_content = ''.join( [i.rstrip('\r\n ').lstrip() for i in response.readlines()] ) html = lxml.html.fromstring(file_content) for i in html.xpath('head/meta'): print lxml.html.tostring(i)
Surprisingly, there's no output of any <meta http-equiv="Content-Type" .. /> element. So, how can I know the original charset of this html?
You need to pass the kwarg `include_meta_content_type=True` to `tostring`, or the <meta http-equiv="Content-Type" .. /> tag will always be stripped on the way out --
from lxml.html import fromstring, tostring x=fromstring("""<html><head><meta http-equiv="Content-Type" content="text/html; charset=ASCII"></head></html>""") x.xpath("head/meta") [<Element meta at 2004bb0>] [tostring(u) for u in x.xpath("head/meta")] [''] [tostring(u, include_meta_content_type=True) for u in x.xpath("head/meta")] ['<meta http-equiv="Content-Type" content="text/html; charset=ASCII">']