how to detect the character encoding in a web page ?
Albert van der Horst
albert at spenarnc.xs4all.nl
Mon Jan 14 13:50:23 CET 2013
In article <roy-DF05DA.11460324122012 at news.panix.com>,
Roy Smith <roy at panix.com> wrote:
>In article <rn%Bs.693798$nB6.605938 at fx21.am4>,
> Alister <alister.ware at ntlworld.com> wrote:
>> Indeed due to the poor quality of most websites it is not possible to be
>> 100% accurate for all sites.
>> personally I would start by checking the doc type & then the meta data as
>> these should be quick & correct, I then use chardectect only if these
>> fail to provide any result.
>I agree that checking the metadata is the right thing to do. But, I
>wouldn't go so far as to assume it will always be correct. There's a
>lot of crap out there with perfectly formed metadata which just happens
>to be wrong.
>Although it pains me greatly to quote Ronald Reagan as a source of
>wisdom, I have to admit he got it right with "Trust, but verify". It's
Not surprisingly, as an actor, Reagan was as good as his script.
This one he got from Stalin.
>the only way to survive in the unicode world. Write defensive code.
>Wrap try blocks around calls that might raise exceptions if the external
>data is borked w/r/t what the metadata claims it should be.
The way to go, of course.
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert at spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
More information about the Python-list