How do I decode unicode characters in the subject using email.message_from_string()?

Roy H. Han starsareblueandfaraway at gmail.com
Wed Feb 25 09:09:49 EST 2009


Thanks for writing back, RDM and John Machin.  Tomorrow I'll try the
code you suggested, RDM.  It looks quite helpful and I'll report the
results.

In the meantime, John asked for more data.  The sender's email client
is Microsoft Outlook 11.  The recipient email client is Lotus Notes.



Actual Subject
=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=

Expected Subject
Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR Records

X-Mailer
Microsoft Office Outlook 11

X-MimeOLE
Produced By Microsoft MimeOLE V6.00.2900.5579



RHH



On Wed, Feb 25, 2009 at 8:39 AM,  <rdmurray at bitdance.com> wrote:
> John Machin <sjmachin at lexicon.net> wrote:
>> On Feb 25, 11:07=A0am, "Roy H. Han" <starsareblueandfara... at gmail.com>
>> wrote:
>> > Dear python-list,
>> >
>> > I'm having some trouble decoding an email header using the standard
>> > imaplib.IMAP4 class and email.message_from_string method.
>> >
>> > In particular, email.message_from_string() does not seem to properly
>> > decode unicode characters in the subject.
>> >
>> > How do I decode unicode characters in the subject?
>>
>> You don't. You can't. You decode str objects into unicode objects. You
>> encode unicode objects into str objects. If your input is not a str
>> object, you have a problem.
>
> I can't speak for the OP, but I had a similar (and possibly
> identical-in-intent) question.  Suppose you have a Subject line that
> looks like this:
>
>    Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=
>
> How do you get the email module to decode that into unicode?  The same
> question applies to the other header lines, and the answer is it isn't
> easy, and I had to read and reread the docs and experiment for a while
> to figure it out.  I understand there's going to be a sprint on the
> email module at pycon, maybe some of this will get improved then.
>
> Here's the final version of my test program.  The third to last line is
> one I thought ought to work given that Header has a __unicode__ method.
> The final line is the one that did work (note the kludge to turn None
> into 'ascii'...IMO 'ascii' is what deocde_header _should_ be returning,
> and this code shows why!)
>
> -------------------------------------------------------------------
> from email import message_from_string
> from email.header import Header, decode_header
>
> x = message_from_string("""\
> To: test
> Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=
>
> this is a test.
> """)
>
> print x
> print "--------------------"
> for key, header in x.items():
>    print key, 'type', type(header)
>    print key+":", unicode(Header(header)).decode('utf-8')
>    print key+":", decode_header(header)
>    print key+":", ''.join([s.decode(t or 'ascii') for (s, t) in decode_header(header)]).encode('utf-8')
> -------------------------------------------------------------------
>
>
>    From nobody Wed Feb 25 08:35:29 2009
>    To: test
>    Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=
>            =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=
>
>    this is a test.
>
>    --------------------
>    To type <type 'str'>
>    To: test
>    To: [('test', None)]
>    To: test
>    Subject type <type 'str'>
>    Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=
>    Subject: [("'u' Obselete type", None), ("-- it is identical to 'd'. (7)", 'iso-8859-1')]
>    Subject: 'u' Obselete type-- it is identical to 'd'. (7)
>
>
> --RDM
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>



More information about the Python-list mailing list