[Python-Dev] Encoding detection in the standard library?

Stephen J. Turnbull stephen at xemacs.org
Wed Apr 23 06:56:47 CEST 2008


"Martin v. Löwis" writes:

 > In any case, I'm very skeptical that a general "guess encoding"
 > module would do a meaningful thing when applied to incorrectly
 > encoded HTML pages.

That depends on whether you can get meaningful information about the
language from the fact that you're looking at the page.  In the
browser context, for one, 99.44% of users are monolingual, so you only
have to distinguish among the encodings for their language.  In this
context a two stage process of determining a category of encoding (eg,
ISO 8859, ISO 2022 7-bit, ISO 2022 8-bit multibyte, UTF-8, etc), and
then picking an encoding from the category according to a
user-specified configuration has served Emacs/MULE users very well for
about 20 years.

It does *not* work in a context where multiple encodings from the same
category are in use (eg, the email folder of a Polish Gastarbeiter in
Berlin).

Nonetheless it is pretty useful for user agents like mail clients, web
browsers, and editors.


More information about the Python-Dev mailing list