[Python-Dev] Encoding detection in the standard library?
Bill Janssen
janssen at parc.com
Tue Apr 22 18:33:18 CEST 2008
The 2002 paper "A language and character set determination method
based on N-gram statistics" by Izumi Suzuki and Yoshiki Mikami and
Ario Ohsato and Yoshihide Chubachi seems to me a pretty good way to go
about this. They're looking at "LSE"s, language-script-encoding
triples; a "script" is a way of using a particular character set to
write in a particular language.
Their system has these requirements:
R1. the response must be either "correct answer" or "unable to detect"
where "unable to detect" includes "other than registered" [the
registered set of LSEs];
R2. Applicable to multi-LSE texts;
R3. never accept a wrong answer, even when the program does not have
enough data on an LSE; and
R4. applicable to any LSE text.
So, no wrong answers.
The biggest disadvantage would seem to be that the registration data
for a particular LSE is kind of bulky; on the order of 10,000
shift-codons, each of three bytes, about 30K uncompressed.
http://portal.acm.org/ft_gateway.cfm?id=772759&type=pdf
Bill
> > IMHO, more research has to be done into this area before a
> > "standard" module can be added to the Python's stdlib... and
> > who knows, perhaps we're lucky and by the time everyone is
> > using UTF-8 anyway :-)
>
> I walked over to our computational linguistics group and asked. This
> is often combined with language guessing (which uses a similar
> approach, but using characters instead of bytes), and apparently can
> usually be done with high confidence. Of course, they're usually
> looking at clean texts, not random "stuff". I'll see if I can get
> some references and report back -- most of the research on this was
> done in the 90's.
>
> Bill
More information about the Python-Dev
mailing list