[Python-Dev] Encoding detection in the standard library?

Bill Janssen janssen at parc.com
Tue Apr 22 17:14:43 CEST 2008


> IMHO, more research has to be done into this area before a
> "standard" module can be added to the Python's stdlib... and
> who knows, perhaps we're lucky and by the time everyone is
> using UTF-8 anyway :-)

I walked over to our computational linguistics group and asked.  This
is often combined with language guessing (which uses a similar
approach, but using characters instead of bytes), and apparently can
usually be done with high confidence.  Of course, they're usually
looking at clean texts, not random "stuff".  I'll see if I can get
some references and report back -- most of the research on this was
done in the 90's.

Bill


More information about the Python-Dev mailing list