How do I decode unicode characters in the subject using email.message_from_string()?

Steve Holden steve at holdenweb.com
Wed Feb 25 09:20:21 EST 2009


Roy H. Han wrote:
> On Wed, Feb 25, 2009 at 8:39 AM,  <rdmurray at bitdance.com> wrote:
[Top-posting corrected]
>> John Machin <sjmachin at lexicon.net> wrote:
>>> On Feb 25, 11:07=A0am, "Roy H. Han" <starsareblueandfara... at gmail.com>
>>> wrote:
>>>> Dear python-list,
>>>>
>>>> I'm having some trouble decoding an email header using the standard
>>>> imaplib.IMAP4 class and email.message_from_string method.
>>>>
>>>> In particular, email.message_from_string() does not seem to properly
>>>> decode unicode characters in the subject.
>>>>
>>>> How do I decode unicode characters in the subject?
>>> You don't. You can't. You decode str objects into unicode objects. You
>>> encode unicode objects into str objects. If your input is not a str
>>> object, you have a problem.
>> I can't speak for the OP, but I had a similar (and possibly
>> identical-in-intent) question.  Suppose you have a Subject line that
>> looks like this:
>>
>>    Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=
>>
>> How do you get the email module to decode that into unicode?  The same
>> question applies to the other header lines, and the answer is it isn't
>> easy, and I had to read and reread the docs and experiment for a while
>> to figure it out.  I understand there's going to be a sprint on the
>> email module at pycon, maybe some of this will get improved then.
>>
>> Here's the final version of my test program.  The third to last line is
>> one I thought ought to work given that Header has a __unicode__ method.
>> The final line is the one that did work (note the kludge to turn None
>> into 'ascii'...IMO 'ascii' is what deocde_header _should_ be returning,
>> and this code shows why!)
>>
>> -------------------------------------------------------------------
>> from email import message_from_string
>> from email.header import Header, decode_header
>>
>> x = message_from_string("""\
>> To: test
>> Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=
>>
>> this is a test.
>> """)
>>
>> print x
>> print "--------------------"
>> for key, header in x.items():
>>    print key, 'type', type(header)
>>    print key+":", unicode(Header(header)).decode('utf-8')
>>    print key+":", decode_header(header)
>>    print key+":", ''.join([s.decode(t or 'ascii') for (s, t) in decode_header(header)]).encode('utf-8')
>> -------------------------------------------------------------------
>>
>>
>>    From nobody Wed Feb 25 08:35:29 2009
>>    To: test
>>    Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=
>>            =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=
>>
>>    this is a test.
>>
>>    --------------------
>>    To type <type 'str'>
>>    To: test
>>    To: [('test', None)]
>>    To: test
>>    Subject type <type 'str'>
>>    Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=
>>    Subject: [("'u' Obselete type", None), ("-- it is identical to 'd'. (7)", 'iso-8859-1')]
>>    Subject: 'u' Obselete type-- it is identical to 'd'. (7)
>>
>>
> Thanks for writing back, RDM and John Machin.  Tomorrow I'll try the
> code you suggested, RDM.  It looks quite helpful and I'll report the
> results.
> 
> In the meantime, John asked for more data.  The sender's email client
> is Microsoft Outlook 11.  The recipient email client is Lotus Notes.
> 
> 
> 
> Actual Subject
> =?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=
> 
> Expected Subject
> Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR Records
> 
> X-Mailer
> Microsoft Office Outlook 11
> 
> X-MimeOLE
> Produced By Microsoft MimeOLE V6.00.2900.5579
> 
>>> from email.header import decode_header
>>> print
decode_header("=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=")
[('Inteum C/SR User Tip:  Quick Access to Recently Opened Inteum C/SR
Records', 'us-ascii')]
>>>

regards
 Steve
-- 
Steve Holden        +1 571 484 6266   +1 800 494 3119
Holden Web LLC              http://www.holdenweb.com/




More information about the Python-list mailing list