Hi all,
I use lxml for a long time and it works fine for me.
But now, I get confused about the charset thing. When I want to get the
original charset of a html file, I used codes below:
file_content = ''.join(
[i.rstrip('\r\n ').lstrip() for i in response.readlines()]
)
html = lxml.html.fromstring(file_content)
for i in html.xpath('head/meta'):
print lxml.html.tostring(i)
Surprisingly, there's no output of any <meta http-equiv="Content-Type" .. />
element. So, how can I know the original charset of this html?
BTW, I used urllib2 to get charset, using the codes below:
req = urllib2.Request(url)
try:
response = urllib2.urlopen(req)
except HTTPError, e:
print e.code
else:
print response.headers.getheader('Content-Type')
Not every sites return its charset, some sites don't return any charset
information.
What I gonna do if I really want to know the charset?
Thanks, guys.
Best wishes,
David
--
----------------------------------------------
Attitude determines everything !
----------------------------------------------