[Python-Dev] Encoding detection in the standard library?

Tue Apr 22 05:30:18 CEST 2008

David Wolever wrote:

> IMO, encoding estimation is something that
> many web programs will have to deal with,
> so it might as well be built in; I would prefer
> the option to run `text=input.encode('guess')`
> (or something similar) than relying on an external
> dependency or worse yet using a hand-rolled
> algorithm

The (still draft) html5 spec is trying to get error-correction
standardized, so it includes all sort of "if this fails, do X".
Encoding detection will be standardized, so there will be an external
standard that we can reference.

http://dev.w3.org/html5/spec/Overview.html#determining

Note that this portion of the spec is probably not stable yet, as
there was some new analysis on which "wrong" answers provided better
results on real world web pages.

e.g.,

http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2008-March/014127.html

http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2008-March/014190.html

There was also a recent analysis of how many characters it takes to
sniff successfully X% of the time on today's web, though I'm not
finding it at the moment.

-jJ