[Python-Dev] Encoding detection in the standard library?

Tue Apr 22 18:33:18 CEST 2008

The 2002 paper "A language and character set determination method
based on N-gram statistics" by Izumi Suzuki and Yoshiki Mikami and
Ario Ohsato and Yoshihide Chubachi seems to me a pretty good way to go
about this.  They're looking at "LSE"s, language-script-encoding
triples; a "script" is a way of using a particular character set to
write in a particular language.

Their system has these requirements:

R1. the response must be either "correct answer" or "unable to detect"
    where "unable to detect" includes "other than registered" [the
    registered set of LSEs];

R2. Applicable to multi-LSE texts;

R3. never accept a wrong answer, even when the program does not have
    enough data on an LSE; and

R4. applicable to any LSE text.

So, no wrong answers.

The biggest disadvantage would seem to be that the registration data
for a particular LSE is kind of bulky; on the order of 10,000
shift-codons, each of three bytes, about 30K uncompressed.

http://portal.acm.org/ft_gateway.cfm?id=772759&type=pdf

Bill

> > IMHO, more research has to be done into this area before a
> > "standard" module can be added to the Python's stdlib... and
> > who knows, perhaps we're lucky and by the time everyone is
> > using UTF-8 anyway :-)
> 
> I walked over to our computational linguistics group and asked.  This
> is often combined with language guessing (which uses a similar
> approach, but using characters instead of bytes), and apparently can
> usually be done with high confidence.  Of course, they're usually
> looking at clean texts, not random "stuff".  I'll see if I can get
> some references and report back -- most of the research on this was
> done in the 90's.
> 
> Bill