[Python-Dev] Encoding detection in the standard library?

M.-A. Lemburg mal at egenix.com
Tue Apr 22 12:31:34 CEST 2008

On 2008-04-21 23:31, Martin v. Löwis wrote:
>> This is useful when you get a hunk of data which _should_ be some  
>> sort of intelligible text from the Big Scary Internet (say, a posted  
>> web form or email message), and you want to do something useful with  
>> it (say, search the content).
> I don't think that should be part of the standard library. People
> will mistake what it tells them for certain.


I also think that it's better to educate people to add (correct)
encoding information to their text data, rather than give them a
guess mechanism...


chardet is based on the Mozilla algorithm and at least in
my experience that algorithm doesn't work too well.

The Mozilla algorithm may work for Asian encodings due to the fact
that those encodings are usually also bound to a specific language
(and you can then use character and word frequency analysis), but
for encodings which can encode far more than just a single language
(e.g. UTF-8 or Latin-1), the correct detection rate is rather low.

The problem becomes completely even more difficult when leaving
the normal text domain or when mixing languages in the same
text, e.g. when trying to detect source code with comments using
a non-ASCII encoding.

The "trick" to just pass the text through a codec and see whether
it roundtrips also doesn't necessarily help: Latin-1, for example,
will always round-trip, since Latin-1 is a subset of Unicode.

IMHO, more research has to be done into this area before a
"standard" module can be added to the Python's stdlib... and
who knows, perhaps we're lucky and by the time everyone is
using UTF-8 anyway :-)

Marc-Andre Lemburg

Professional Python Services directly from the Source  (#1, Apr 22 2008)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
            Registered at Amtsgericht Duesseldorf: HRB 46611

More information about the Python-Dev mailing list