Distinguishing cp850 and cp1252?

David Eppstein eppstein at ics.uci.edu
Mon Nov 3 00:47:00 EST 2003


In article <vqbfqr373nfa0c at news.supernews.com>,
 "John Roth" <newsgroups at jhrothjr.com> wrote:

> > Is there an easy way of guessing with reasonable accuracy which of these
> > two incodings was used for a particular file?
> 
> The only way I know of is to do a statistical analysis on letter
> frequencies. To do that, you have to know your data fairly well.
> For example, CP850 has a number of characters devoted to box
> drawing characters. If your data doesn't involve drawing boxes,
> and you find those characters in the input, I'd say that's a strong
> clue that you're dealing with CP1252.

Thanks.  After trying some other more hackish things which sort of 
worked (e.g. does the encoding lead to unicodes with ord>255?) I settled 
on a very simple statistical scheme: vote for how many times the 
encoding produces unicodes that answer true to isalpha().  Seems to be 
working...

-- 
David Eppstein                      http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science




More information about the Python-list mailing list