how to detect the character encoding in a web page ?

Albert van der Horst albert at
Mon Jan 14 13:50:23 CET 2013

In article <roy-DF05DA.11460324122012 at>,
Roy Smith  <roy at> wrote:
>In article <rn%Bs.693798$nB6.605938 at fx21.am4>,
> Alister <alister.ware at> wrote:
>> Indeed due to the poor quality of most websites it is not possible to be
>> 100% accurate for all sites.
>> personally I would start by checking the doc type & then the meta data as
>> these should be quick & correct, I then use chardectect only if these
>> fail to provide any result.
>I agree that checking the metadata is the right thing to do.  But, I
>wouldn't go so far as to assume it will always be correct.  There's a
>lot of crap out there with perfectly formed metadata which just happens
>to be wrong.
>Although it pains me greatly to quote Ronald Reagan as a source of
>wisdom, I have to admit he got it right with "Trust, but verify".  It's

Not surprisingly, as an actor, Reagan was as good as his script.
This one he got from Stalin.

>the only way to survive in the unicode world.  Write defensive code.
>Wrap try blocks around calls that might raise exceptions if the external
>data is borked w/r/t what the metadata claims it should be.

The way to go, of course.

Groetjes Albert
Economic growth -- being exponential -- ultimately falters.
albert at spe&ar& &=n

More information about the Python-list mailing list