how to detect the character encoding in a web page ?
alister.ware at ntlworld.com
Mon Dec 24 17:27:03 CET 2012
On Mon, 24 Dec 2012 13:50:39 +0000, Steven D'Aprano wrote:
> On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote:
>> On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
>> <kurt.alfred.mueller at gmail.com> wrote:
>>> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2
>>> with confidence 0.803579722043 $
>> And it sucks, because it uses magic, and not reading the HTML tags. The
>> RIGHT thing to do for websites is detect the meta charset definition,
>> which is
>> <meta http-equiv="content-type" content="text/html; charset=utf-8">
>> <meta charset="utf-8">
>> The second one for HTML5 websites, and both may require case conversion
>> and the useless ` /` at the end. But if somebody is using HTML5, you
>> are pretty much guaranteed to get UTF-8.
>> In today’s world, the proper assumption to make is “UTF-8 or GTFO”.
>> Because nobody in the right mind would use something else today.
> Alas, there are many, many, many, MANY websites that are created by
> people who are *not* in their right mind. To say nothing of 15 year old
> websites that use a legacy encoding. And to support those, you may need
> to guess the encoding, and for that, chardetect.py is the solution.
Indeed due to the poor quality of most websites it is not possible to be
100% accurate for all sites.
personally I would start by checking the doc type & then the meta data as
these should be quick & correct, I then use chardectect only if these
fail to provide any result.
I have found little that is good about human beings. In my experience
most of them are trash.
-- Sigmund Freud
More information about the Python-list