algorithm to autodetect (japanese) encodings..

gabor gabor at
Wed Mar 12 23:28:00 CET 2003


i' playing with mp3 tags,
and it's a hell to display them correctly, because most of them has
id3v1 tags, where isn't any encoding info.

so you have to guess...

most of files have standard english names, so nothing is above 127.
some of them have latin1 encoding, some utf-8, and some have some jap.
encodings ( anime soundtracks :-)

i'm trying to write a toUnicode function:
it should do the following:

if all the characters are below 127, simply convert to unicode as latin1
or utf-8 encoding ( should be the same)

if some chars are above 127, apply some heuristics to separate utf-8,
and those 3 jap. encodings ( shift-jis, iso-2022-jp, euc-jp).

my question:
does anyone have a working algo to find the correct encoding between the
3 jap. encodings?
i know java does it but does anyone have a python sourecode?
or something simply-translatable-to-python?



More information about the Python-list mailing list