<html>
<head>
<style><!--
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
font-size: 12pt;
font-family:Calibri
}
--></style></head>
<body class='hmmessage'><div dir='ltr'>Try this:<br><br>### get_charset.py ###<br>import re<br>import urllib2<br><br>def get_charset(url):<br> resp = urllib2.urlopen(url)<br> #retrieve charset from header<br> headers = ''.join(resp.headers.headers)<br> charset_from_header_list = re.findall('charset=(.*)', headers)<br> charset_from_header = charset_from_header_list[-1] if charset_from_header_list else ''<br><br> #retrieve charset from html<br> html = resp.read()<br> charset_from_html_list = re.findall('Content-Type.*charset=["\']?(.*)["\']', html)<br> charset_from_html = charset_from_html_list[-1] if charset_from_html_list else ''<br><br> return charset_from_html if charset_from_html else charset_from_header<br><br><br><br><br><div>> Date: Sun, 9 Jun 2013 04:47:02 -0700<br>> Subject: Re: how to detect the character encoding in a web page ?<br>> From: redstone-cold@163.com<br>> To: python-list@python.org<br>> <br>> 在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:<br>> > how to detect the character encoding in a web page ?<br>> > <br>> > such as this page <br>> > <br>> > <br>> > <br>> > http://python.org/<br>> <br>> Finally ,I found by using PyQt’s QtextStream , QTextCodec and chardet ,we can get a web page code more securely <br>> even for this bad page<br>> http://www.qnwz.cn/html/yinlegushihui/magazine/2013/0524/425731.html <br>> <br>> this script <br>> http://www.flvxz.com/getFlv.php?url=aHR0cDojI3d3dy41Ni5jb20vdTk1L3ZfT1RFM05UYzBNakEuaHRtbA==<br>> <br>> and this page without chardet in its source code <br>> http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx<br>> <br>> <br>> from PyQt4.QtCore import *<br>> from PyQt4.QtGui import *<br>> from PyQt4.QtNetwork import *<br>> import sys<br>> import chardet<br>> <br>> def slotSourceDownloaded(reply):<br>> redirctLocation=reply.header(QNetworkRequest.LocationHeader)<br>> redirctLocationUrl=reply.url() if not redirctLocation else redirctLocation<br>> #print(redirctLocationUrl,reply.header(QNetworkRequest.ContentTypeHeader))<br>> <br>> if (reply.error()!= QNetworkReply.NoError):<br>> print('11111111', reply.errorString())<br>> return<br>> <br>> pageCode=reply.readAll()<br>> charCodecInfo=chardet.detect(pageCode.data())<br>> <br>> textStream=QTextStream(pageCode)<br>> codec=QTextCodec.codecForHtml(pageCode,QTextCodec.codecForName(charCodecInfo['encoding'] ))<br>> textStream.setCodec(codec)<br>> content=textStream.readAll()<br>> print(content)<br>> <br>> if content=='':<br>> print('---------', 'cannot find any resource !')<br>> return<br>> <br>> reply.deleteLater()<br>> qApp.quit()<br>> <br>> <br>> if __name__ == '__main__':<br>> app =QCoreApplication(sys.argv)<br>> manager=QNetworkAccessManager ()<br>> url =input('input url :')<br>> request=QNetworkRequest (QUrl.fromEncoded(QUrl.fromUserInput(url).toEncoded()))<br>> request.setRawHeader("User-Agent" ,'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17 SE 2.X MetaSr 1.0')<br>> manager.get(request)<br>> manager.finished.connect(slotSourceDownloaded)<br>> sys.exit(app.exec_())<br>> -- <br>> http://mail.python.org/mailman/listinfo/python-list<br></div> </div></body>
</html>