An attempt at guessing the encoding of a (non-unicode) string
Christos TZOTZIOY Georgiou
tzot at sil-tec.gr
Wed Apr 7 05:20:38 EDT 2004
On Mon, 05 Apr 2004 13:37:34 -0700, rumours say that David Eppstein
<eppstein at ics.uci.edu> might have written:
>BTW, if you're going to implement the single-char version, at least for
>encodings that translate one byte -> one unicode position (e.g., not
>utf8), and your texts are large enough, it will be faster to precompute
>a table of byte frequencies in the text and then compute the score by
>summing the frequencies of alphabetic bytes.
Thanks for the pointer, David. However, as it often happens, I came
second (or, probably, n-th :). Seo Sanghyeon sent a URL that includes a
two-char proposal, and it provides an algorithm in section 4.7.1 that I
find appropriate for this matter:
http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
--
TZOTZIOY, I speak England very best,
Ils sont fous ces Redmontains! --Harddix
More information about the Python-list
mailing list