Re: [Python-Dev] Encoding detection in the standard library?
[CCing python-dev again] On 2008-04-22 12:38, Greg Wilson wrote:
I don't think that should be part of the standard library. People will mistake what it tells them for certain. [etc]
These are all good arguments, but the fact remains that we can't control our inputs (e.g., we're archiving mail messages sent to lists managed by DrProject), and some of those inputs *don't* tell us how they're encoded. Under those circumstances, what would you recommend?
I haven't done much research into this, but in general, I think it's better to: * first try to look at other characteristics of a text message, e.g. language, origin, topic, etc., * then narrow down the number of encodings which could apply, * rank them to try to avoid ambiguities and * then try to see what percentage of the text you can decode using each of the encodings in reverse ranking order (ie. more specialized encodings should be tested first, latin-1 last). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 22 2008)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
participants (1)
-
M.-A. Lemburg