[Tutor] i18n Encoding/Decoding issues

Mon Aug 14 13:52:44 CEST 2006

Jorge De Castro wrote:
> Hi all,
>
> It seems I can't get rid of my continuous issues i18n with Python :(
>   
You're not alone :-)
> I've been through:
> http://docs.python.org/lib/module-email.Header.html
> and
> http://www.reportlab.com/i18n/python_unicode_tutorial.html
> to no avail.
>   
Try these:
http://www.joelonsoftware.com/articles/Unicode.html
http://jorendorff.com/articles/unicode/index.html
> Basically, I'm receiving and processing mail that comes with content (from 
> an utf-8 accepting form) from many locales (France, Germany, etc)
>
> def splitMessage() does what the name indicates, and send message is the 
> code below.
>
> def sendMessage(text):
>     to, From, subject, body = splitMessage(text)
>     msg = MIMEText(decodeChars(body), 'plain', 'UTF-8')
>     msg['From'] = From
>     msg['To'] = to
>     msg['Subject'] = Header(decodeChars(subject), 'UTF-8')
>
> def decodeChars(str=""):
>     if not str: return None
>     for characterCode in _characterCodes.keys():
>         str = str.replace(characterCode, _characterCodes[characterCode])
>     return str
>
> Now as you have noticed, this only works, ie, I get an email sent with the 
> i18n characters displayed correctly, after I pretty much wrote my own 
> 'urldecode' map
>
> _characterCodes ={  "%80" : "&#65533;", "%82" : "&#65533;", "%83" : 
> "&#65533;", "%84" : "&#65533;", \
>                     "%85" : "&#65533;", "%86" : "&#65533;",	"%87" : 
> "&#65533;", "%88" : "&#65533;", \
>                     "%89" : "&#65533;", "%8A" : "&#65533;", "%8B" : 
> "&#65533;", "%8C" : "&#65533;", \
>                     "%8E" : "&#65533;", "%91" : "&#65533;", "%92" : 
> "&#65533;", "%93" : "&#65533;", \
>                     "%94" : "&#65533;", "%95" : "&#65533;", "%96" : 
> "&#65533;", "%97" : "&#65533;", \
> ...
>
> Which feels like an horrible kludge.
>   
This _characterCodes map replaces chars is the range 80-9F with a 
Unicode "undefined" marker, so I don't understand how using it gives you 
a correct result.
> Note that using urlilib.unquote doesn't do it -I get an error saying that it 
> is unable to . Replacing my decodeChars
>
> msg = MIMEText(urllib.unquote(body), 'plain', 'UTF-8')
>
> Returns content with i18n characters mangled.
>   
 From the selection of characters you have chosen to replace, my guess 
is that your source data is urlencoded Cp1252, not urlencoded UTF-8. So 
when you unquote it and then call it UTF-8, which is what the above code 
does, you get incorrect display. What happens if you change UTF-8 to 
Cp1252 in the call to MIMEText?
> Using unicode(body, 'latin-1').encode('utf-8') doesn't work either. Besides, 
> am I the only one to feel that if I want to encode something in UTF-8 it 
> doesn't feel intuitive to have to convert to latin-1 first and then encode?
>   
It doesn't work because the urlencoded text is ascii, not latin-1. I 
suspect that
  unicode(urllib.unquote(body), 'Cp12521).decode('UTF-8')
would give you what you want.
> Any ideas? I am dry on other option and really don't want to keep my kludge 
> (unless I absolutely have to)
>   
Post some of your actual data, it will be obvious whether it is encoded 
from Cp1252 or UTF-8.

Keep trying, it's worth it to actually understand what is going on. 
Trying to solve encoding problems when you don't understand the basic 
issues is unlikely to give a good solution.

Kent