recycling internationalized garbage
Fredrik Lundh
fredrik at pythonware.com
Wed Mar 8 09:33:55 EST 2006
"aaronwmail-usenet at yahoo.com" wrote:
> Question: what is a good strategy for taking an 8bit
> string of unknown encoding and recovering the largest
> amount of reasonable information from it (translated to
> utf8 if needed)? The string might be in any of the
> myriad encodings that predate unicode. Has anyone
> done this in Python already? The output must be clean
> utf8 suitable for arbitrary xml parsers.
some alternatives:
braindead bruteforce:
try to do strict decoding as utf-8. if you succeed, you have an utf-8
string. if not, assume iso-8859-1.
slightly smarter bruteforce:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/163743
more advanced (but possibly not good enough for very short texts):
http://chardet.feedparser.org/
</F>
More information about the Python-list
mailing list