recycling internationalized garbage

garabik-news-2005-05 at kassiopeia.juls.savba.sk garabik-news-2005-05 at kassiopeia.juls.savba.sk
Wed Mar 8 16:04:59 CET 2006


Fredrik Lundh <fredrik at pythonware.com> wrote:
> "aaronwmail-usenet at yahoo.com" wrote:
> 
>> Question: what is a good strategy for taking an 8bit
>> string of unknown encoding and recovering the largest
>> amount of reasonable information from it (translated to
>> utf8 if needed)?  The string might be in any of the
>> myriad encodings that predate unicode.  Has anyone
>> done this in Python already?  The output must be clean
>> utf8 suitable for arbitrary xml parsers.
> 
> some alternatives:
> 
> braindead bruteforce:
> 
>    try to do strict decoding as utf-8.  if you succeed, you have an utf-8
>    string.  if not, assume iso-8859-1.

that was a mistake I made once.
Do not use iso8859-1 as python codec, instead create your own codec
called e.g. iso8859-1-ncc like this (just a sketch):

decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32, 256))
decoding_map.update({})
encoding_map = codecs.make_encoding_map(decoding_map)

and then use :

def try_encoding(s, encodings):
    "try to guess the encoding of string s, testing encodings given in second parameter"

    for enc in encodings:
        try:
            test = unicode(s, enc)
            return enc
        except UnicodeDecodeError, r:
            pass

    return None


guessed_unicode_text = try_encodings(text, ['utf-8', 'iso8859-1-ncc', 'cp1252', 'macroman'])


it seems to work surprisingly well, if you know approximately the
language(s) the text is expected to be in (e.g. replace cp1252 with
cp1250, iso8859-1-ncc with iso8859-2-ncc for central european languages) 

-- 
 -----------------------------------------------------------
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__    garabik @ kassiopeia.juls.savba.sk     |
 -----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!



More information about the Python-list mailing list