Distinguishing cp850 and cp1252?

"Martin v. Löwis" martin at v.loewis.de
Sun Nov 2 21:37:45 EST 2003


David Eppstein wrote:

> Is there an easy way of guessing with reasonable accuracy which of these 
> two incodings was used for a particular file?

You could try the assumption that most characters should be letters,
assuming your documents are likely text documents of some sort. The idea
is that what is a letter in one code is some non-letter graphical symbol
in the other.

So you would create a predicate "isletter" for each character set, and
then count the number of bytes in a document which are not letters. You
should probably exclude the ASCII characters in counting, since they
would have the same interpretation in either code. The code that gives
you fewer/none no-letter characters is likely the correct
interpretation.

To find out which bytes are letters, you could use unicodedata.category;
letters start with "L" (followed by either "l" or "u", depending on
case). You should compute a bitmap for each character set up-front, and
you should find out what the overlap in set bits is.

To get a higher accuracy, you need advance knowledge on the natural
language your documents are in, and then you need to use a dictionary
of that language.

HTH,
Martin





More information about the Python-list mailing list