[Python-Dev] Encoding detection in the standard library?

Jean-Paul Calderone exarkun at divmod.com
Mon Apr 21 18:59:49 CEST 2008


On Mon, 21 Apr 2008 17:50:43 +0100, Michael Foord <fuzzyman at voidspace.org.uk> wrote:
>skip at pobox.com wrote:
>>     David> Is there some sort of text encoding detection module is the
>>     David> standard library?  And, if not, is there any reason not to add
>>     David> one?
>>
>> No, there's not.  I suspect the fact that you can't correctly determine the
>> encoding of a chunk of text 100% of the time mitigates against it.
>>
>
>The only approach I know of is a heuristic based approach. e.g.
>
>http://www.voidspace.org.uk/python/articles/guessing_encoding.shtml
>
>(Which was 'borrowed' from docutils in the first place.)

This isn't the only approach, although you're right that in general you
have to rely on heuristics.  See the charset detection features of ICU:

  http://www.icu-project.org/userguide/charsetDetection.html

I think OSAF's pyicu exposes these APIs:

  http://pyicu.osafoundation.org/

Jean-Paul


More information about the Python-Dev mailing list