[Tutor] i18n Encoding/Decoding issues
Kent Johnson
kent37 at tds.net
Mon Aug 14 13:52:44 CEST 2006
Jorge De Castro wrote:
> Hi all,
>
> It seems I can't get rid of my continuous issues i18n with Python :(
>
You're not alone :-)
> I've been through:
> http://docs.python.org/lib/module-email.Header.html
> and
> http://www.reportlab.com/i18n/python_unicode_tutorial.html
> to no avail.
>
Try these:
http://www.joelonsoftware.com/articles/Unicode.html
http://jorendorff.com/articles/unicode/index.html
> Basically, I'm receiving and processing mail that comes with content (from
> an utf-8 accepting form) from many locales (France, Germany, etc)
>
> def splitMessage() does what the name indicates, and send message is the
> code below.
>
> def sendMessage(text):
> to, From, subject, body = splitMessage(text)
> msg = MIMEText(decodeChars(body), 'plain', 'UTF-8')
> msg['From'] = From
> msg['To'] = to
> msg['Subject'] = Header(decodeChars(subject), 'UTF-8')
>
> def decodeChars(str=""):
> if not str: return None
> for characterCode in _characterCodes.keys():
> str = str.replace(characterCode, _characterCodes[characterCode])
> return str
>
> Now as you have noticed, this only works, ie, I get an email sent with the
> i18n characters displayed correctly, after I pretty much wrote my own
> 'urldecode' map
>
> _characterCodes ={ "%80" : "�", "%82" : "�", "%83" :
> "�", "%84" : "�", \
> "%85" : "�", "%86" : "�", "%87" :
> "�", "%88" : "�", \
> "%89" : "�", "%8A" : "�", "%8B" :
> "�", "%8C" : "�", \
> "%8E" : "�", "%91" : "�", "%92" :
> "�", "%93" : "�", \
> "%94" : "�", "%95" : "�", "%96" :
> "�", "%97" : "�", \
> ...
>
> Which feels like an horrible kludge.
>
This _characterCodes map replaces chars is the range 80-9F with a
Unicode "undefined" marker, so I don't understand how using it gives you
a correct result.
> Note that using urlilib.unquote doesn't do it -I get an error saying that it
> is unable to . Replacing my decodeChars
>
> msg = MIMEText(urllib.unquote(body), 'plain', 'UTF-8')
>
> Returns content with i18n characters mangled.
>
From the selection of characters you have chosen to replace, my guess
is that your source data is urlencoded Cp1252, not urlencoded UTF-8. So
when you unquote it and then call it UTF-8, which is what the above code
does, you get incorrect display. What happens if you change UTF-8 to
Cp1252 in the call to MIMEText?
> Using unicode(body, 'latin-1').encode('utf-8') doesn't work either. Besides,
> am I the only one to feel that if I want to encode something in UTF-8 it
> doesn't feel intuitive to have to convert to latin-1 first and then encode?
>
It doesn't work because the urlencoded text is ascii, not latin-1. I
suspect that
unicode(urllib.unquote(body), 'Cp12521).decode('UTF-8')
would give you what you want.
> Any ideas? I am dry on other option and really don't want to keep my kludge
> (unless I absolutely have to)
>
Post some of your actual data, it will be obvious whether it is encoded
from Cp1252 or UTF-8.
Keep trying, it's worth it to actually understand what is going on.
Trying to solve encoding problems when you don't understand the basic
issues is unlikely to give a good solution.
Kent
More information about the Tutor
mailing list