recycling internationalized garbage

garabik-news-2005-05 at garabik-news-2005-05 at
Wed Mar 8 10:04:59 EST 2006

Fredrik Lundh <fredrik at> wrote:
> "aaronwmail-usenet at" wrote:
>> Question: what is a good strategy for taking an 8bit
>> string of unknown encoding and recovering the largest
>> amount of reasonable information from it (translated to
>> utf8 if needed)?  The string might be in any of the
>> myriad encodings that predate unicode.  Has anyone
>> done this in Python already?  The output must be clean
>> utf8 suitable for arbitrary xml parsers.
> some alternatives:
> braindead bruteforce:
>    try to do strict decoding as utf-8.  if you succeed, you have an utf-8
>    string.  if not, assume iso-8859-1.

that was a mistake I made once.
Do not use iso8859-1 as python codec, instead create your own codec
called e.g. iso8859-1-ncc like this (just a sketch):

decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32, 256))
encoding_map = codecs.make_encoding_map(decoding_map)

and then use :

def try_encoding(s, encodings):
    "try to guess the encoding of string s, testing encodings given in second parameter"

    for enc in encodings:
            test = unicode(s, enc)
            return enc
        except UnicodeDecodeError, r:

    return None

guessed_unicode_text = try_encodings(text, ['utf-8', 'iso8859-1-ncc', 'cp1252', 'macroman'])

it seems to work surprisingly well, if you know approximately the
language(s) the text is expected to be in (e.g. replace cp1252 with
cp1250, iso8859-1-ncc with iso8859-2-ncc for central european languages) 

| Radovan GarabĂ­k |
| __..--^^^--..__    garabik @     |
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!

More information about the Python-list mailing list