Detecting Russian and Ukrainian character sets
tchur at optushome.com.au
Thu Sep 12 17:04:22 EDT 2002
Here are two questions for Russian and Ukrainian Python users:
1) I understand that a common problem when processing text data
collected from various sources in Russia and the Ukraine is
the mixture of character sets which are used - MS-DOS, Windows,
Linux, Unix and mac machines may all use one (or more) of a number
of character sets to encode strings, and when such data are
supplied in text files, there is usually no indication of
which character set was used. Is this correct?
http://czyborra.com/charsets/cyrillic.html has a listing of
known Cyrillic character sets.
2) Are there any Python routines available for automatically
deducing which character set was used to encode a particular
text file (or a particular string)? There is a module for
Perl called Lingua:RU:Charset which seems to address this problem
at least for Russian encodings.
More information about the Python-list