Distinguishing cp850 and cp1252?
"Martin v. Löwis"
martin at v.loewis.de
Sun Nov 2 21:37:45 EST 2003
David Eppstein wrote:
> Is there an easy way of guessing with reasonable accuracy which of these
> two incodings was used for a particular file?
You could try the assumption that most characters should be letters,
assuming your documents are likely text documents of some sort. The idea
is that what is a letter in one code is some non-letter graphical symbol
in the other.
So you would create a predicate "isletter" for each character set, and
then count the number of bytes in a document which are not letters. You
should probably exclude the ASCII characters in counting, since they
would have the same interpretation in either code. The code that gives
you fewer/none no-letter characters is likely the correct
interpretation.
To find out which bytes are letters, you could use unicodedata.category;
letters start with "L" (followed by either "l" or "u", depending on
case). You should compute a bitmap for each character set up-front, and
you should find out what the overlap in set bits is.
To get a higher accuracy, you need advance knowledge on the natural
language your documents are in, and then you need to use a dictionary
of that language.
HTH,
Martin
More information about the Python-list
mailing list