recycling internationalized garbage

aaronwmail-usenet at yahoo.com aaronwmail-usenet at yahoo.com
Wed Mar 8 15:22:19 CET 2006


Hi folks,

Please help me with international string issues:
I put together an AJAX discography search engine

http://www.xfeedme.com/discs/discography.html

using data from the FreeDB music database

http://www.freedb.org/

Unfortunately FreeDB has a lot of junk in it, including
randomly mixed character encodings for international
strings.  As an expediency I decided to just delete all
characters that weren't ascii, so I could get the thing
running.  Now I look through the log files and notice that
a certain category of user immediatly homes in on this
and finds it amusing to see how badly I've mangled
the strings :(.  I presume they chuckle and make
disparaging remarks about "united states of ascii"
and then leave never to return.

Question: what is a good strategy for taking an 8bit
string of unknown encoding and recovering the largest
amount of reasonable information from it (translated to
utf8 if needed)?  The string might be in any of the
myriad encodings that predate unicode.  Has anyone
done this in Python already?  The output must be clean
utf8 suitable for arbitrary xml parsers.

Thanks,  -- Aaron Watters

===

As someone once remarked to Schubert
"take me to your leider" (sorry about that).
   -- Tom Lehrer




More information about the Python-list mailing list