how to detect the character encoding in a web page ?
steve+comp.lang.python at pearwood.info
Mon Dec 24 14:50:39 CET 2012
On Mon, 24 Dec 2012 13:16:16 +0100, Kwpolska wrote:
> On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
> <kurt.alfred.mueller at gmail.com> wrote:
>> $ wget -q -O - http://python.org/ | chardetect.py stdin: ISO-8859-2
>> with confidence 0.803579722043 $
> And it sucks, because it uses magic, and not reading the HTML tags. The
> RIGHT thing to do for websites is detect the meta charset definition,
> which is
> <meta http-equiv="content-type" content="text/html; charset=utf-8">
> <meta charset="utf-8">
> The second one for HTML5 websites, and both may require case conversion
> and the useless ` /` at the end. But if somebody is using HTML5, you
> are pretty much guaranteed to get UTF-8.
> In today’s world, the proper assumption to make is “UTF-8 or GTFO”.
> Because nobody in the right mind would use something else today.
Alas, there are many, many, many, MANY websites that are created by
people who are *not* in their right mind. To say nothing of 15 year old
websites that use a legacy encoding. And to support those, you may need
to guess the encoding, and for that, chardetect.py is the solution.
More information about the Python-list