how to detect the character encoding in a web page ?

Kwpolska kwpolska at
Mon Dec 24 13:16:16 CET 2012

On Mon, Dec 24, 2012 at 9:34 AM, Kurt Mueller
<kurt.alfred.mueller at> wrote:
> $ wget -q -O - |
> stdin: ISO-8859-2 with confidence 0.803579722043
> $

And it sucks, because it uses magic, and not reading the HTML tags.
The RIGHT thing to do for websites is detect the meta charset
definition, which is

    <meta http-equiv="content-type" content="text/html; charset=utf-8">


    <meta charset="utf-8">

The second one for HTML5 websites, and both may require case
conversion and the useless ` /` at the end.  But if somebody is using
HTML5, you are pretty much guaranteed to get UTF-8.

In today’s world, the proper assumption to make is “UTF-8 or GTFO”.
Because nobody in the right mind would use something else today.

Kwpolska <>
stop html mail      | always bottom-post |

More information about the Python-list mailing list