recycling internationalized garbage

Fredrik Lundh fredrik at
Wed Mar 8 15:33:55 CET 2006

"aaronwmail-usenet at" wrote:

> Question: what is a good strategy for taking an 8bit
> string of unknown encoding and recovering the largest
> amount of reasonable information from it (translated to
> utf8 if needed)?  The string might be in any of the
> myriad encodings that predate unicode.  Has anyone
> done this in Python already?  The output must be clean
> utf8 suitable for arbitrary xml parsers.

some alternatives:

braindead bruteforce:

    try to do strict decoding as utf-8.  if you succeed, you have an utf-8
    string.  if not, assume iso-8859-1.

slightly smarter bruteforce:

more advanced (but possibly not good enough for very short texts):


More information about the Python-list mailing list