how to detect the character encoding in a web page ?
Chris Angelico
rosuav at gmail.com
Wed Jun 5 13:55:11 EDT 2013
On Thu, Jun 6, 2013 at 1:14 AM, iMath <redstone-cold at 163.com> wrote:
> 在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
>> how to detect the character encoding in a web page ?
>>
>> such as this page
>>
>>
>>
>> http://python.org/
>
> by the way ,we cannot get character encoding programmatically from the mate data without knowing the character encoding ahead !
The rules for web pages are (massively oversimplified):
1) HTTP header
2) ASCII-compatible encoding and meta tag
The HTTP header is completely out of band. This is the best way to
transmit encoding information. Otherwise, you assume 7-bit ASCII and
start parsing. Once you find a meta tag, you stop parsing and go back
to the top, decoding in the new way. "ASCII-compatible" covers a huge
number of encodings, so it's not actually much of a problem to do
this.
ChrisA
More information about the Python-list
mailing list