recycling internationalized garbage
fredrik at pythonware.com
Wed Mar 8 15:33:55 CET 2006
"aaronwmail-usenet at yahoo.com" wrote:
> Question: what is a good strategy for taking an 8bit
> string of unknown encoding and recovering the largest
> amount of reasonable information from it (translated to
> utf8 if needed)? The string might be in any of the
> myriad encodings that predate unicode. Has anyone
> done this in Python already? The output must be clean
> utf8 suitable for arbitrary xml parsers.
try to do strict decoding as utf-8. if you succeed, you have an utf-8
string. if not, assume iso-8859-1.
slightly smarter bruteforce:
more advanced (but possibly not good enough for very short texts):
More information about the Python-list