recycling internationalized garbage
aaronwmail-usenet at yahoo.com
aaronwmail-usenet at yahoo.com
Tue Mar 14 10:18:06 EST 2006
Regarding cleaning of mixed string encodings in
the discography search engine
http://www.xfeedme.com/discs/discography.html
Following </F>'s suggestion I came up with this:
utf8enc = codecs.getencoder("utf8")
utf8dec = codecs.getdecoder("utf8")
iso88591dec = codecs.getdecoder("iso-8859-1")
def checkEncoding(s):
try:
(uni, dummy) = utf8dec(s)
except:
(uni, dummy) = iso88591dec(s, 'ignore')
(out, dummy) = utf8enc(uni)
return out
This works nicely for Nordic stuff like
"björgvin halldórsson - gunnar Þórðarson",
but russian seems to turn into garbage
and I have no idea about chinese.
Unless someone has any other ideas I'm
giving up now.
-- Aaron Watters
===
In theory, theory is the same as practice.
In practice it's more complicated than that.
-- folklore
More information about the Python-list
mailing list