Mailman 3 Re: [Python-Dev] Encoding detection in the standard library? - Python-Dev

22 Apr 2008

      [CCing python-dev again]

On 2008-04-22 12:38, Greg Wilson wrote:
...
...
...
I don't think that should be part of the standard library. People
will mistake what it tells them for certain.
[etc]
These are all good arguments, but the fact remains that we can't control 
our inputs (e.g., we're archiving mail messages sent to lists managed by 
DrProject), and some of those inputs *don't* tell us how they're encoded.
Under those circumstances, what would you recommend?
I haven't done much research into this, but in general, I think it's
better to:

  * first try to look at other characteristics of a text
    message, e.g. language, origin, topic, etc.,

  * then narrow down the number of encodings which could apply,

  * rank them to try to avoid ambiguities and

  * then try to see what percentage of the text you can decode using
    each of the encodings in reverse ranking order (ie. more specialized
    encodings should be tested first, latin-1 last).

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 22 2008)
...
...
...
Python/Zope Consulting and Support ...        http://www.egenix.com/
mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
            Registered at Amtsgericht Duesseldorf: HRB 46611

Re: [Python-Dev] Encoding detection in the standard library?

M.-A. Lemburg

tags

participants (1)