how to detect the character encoding in a web page ?
roy at panix.com
Mon Dec 24 17:46:03 CET 2012
In article <rn%Bs.693798$nB6.605938 at fx21.am4>,
Alister <alister.ware at ntlworld.com> wrote:
> Indeed due to the poor quality of most websites it is not possible to be
> 100% accurate for all sites.
> personally I would start by checking the doc type & then the meta data as
> these should be quick & correct, I then use chardectect only if these
> fail to provide any result.
I agree that checking the metadata is the right thing to do. But, I
wouldn't go so far as to assume it will always be correct. There's a
lot of crap out there with perfectly formed metadata which just happens
to be wrong.
Although it pains me greatly to quote Ronald Reagan as a source of
wisdom, I have to admit he got it right with "Trust, but verify". It's
the only way to survive in the unicode world. Write defensive code.
Wrap try blocks around calls that might raise exceptions if the external
data is borked w/r/t what the metadata claims it should be.
More information about the Python-list