Distinguishing cp850 and cp1252?
eppstein at ics.uci.edu
Mon Nov 3 06:47:00 CET 2003
In article <vqbfqr373nfa0c at news.supernews.com>,
"John Roth" <newsgroups at jhrothjr.com> wrote:
> > Is there an easy way of guessing with reasonable accuracy which of these
> > two incodings was used for a particular file?
> The only way I know of is to do a statistical analysis on letter
> frequencies. To do that, you have to know your data fairly well.
> For example, CP850 has a number of characters devoted to box
> drawing characters. If your data doesn't involve drawing boxes,
> and you find those characters in the input, I'd say that's a strong
> clue that you're dealing with CP1252.
Thanks. After trying some other more hackish things which sort of
worked (e.g. does the encoding lead to unicodes with ord>255?) I settled
on a very simple statistical scheme: vote for how many times the
encoding produces unicodes that answer true to isalpha(). Seems to be
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science
More information about the Python-list