recycling internationalized garbage

aaronwmail-usenet at aaronwmail-usenet at
Tue Mar 14 16:18:06 CET 2006

Regarding cleaning of mixed string encodings in
the discography search engine

Following </F>'s suggestion I came up with this:

utf8enc = codecs.getencoder("utf8")
utf8dec = codecs.getdecoder("utf8")
iso88591dec = codecs.getdecoder("iso-8859-1")

def checkEncoding(s):
        (uni, dummy) = utf8dec(s)
        (uni, dummy) = iso88591dec(s, 'ignore')
    (out, dummy) = utf8enc(uni)
    return out

This works nicely for Nordic stuff like
"björgvin halldórsson - gunnar Þórðarson",
but russian seems to turn into garbage
and I have no idea about chinese.

Unless someone has any other ideas I'm
giving up now.
   -- Aaron Watters


In theory, theory is the same as practice.
In practice it's more complicated than that.
  -- folklore

