[New-bugs-announce] [issue35547] email.parser / email.policy does correctly handle multiple RFC2047 encoded-word tokens across RFC5322 folded headers

Thu Dec 20 12:54:07 EST 2018

New submission from Martijn Pieters <mj at python.org>:

The From header in the following email headers is not correctly decoded; both the subject and from headers contain UTF-8 encoded data encoded with RFC2047 encoded-words, in both cases a multi-byte UTF-8 codepoint has been split between the two encoded-word tokens:

>>> msgdata = '''\
From: =?utf-8?b?4ZuX4Zqr4Zqx4ZuP4ZuB4ZuD4Zq+4ZuI4ZuB4ZuW4ZuP4ZuW4Zo=?=
 =?utf-8?b?seGbiw==?= <martijn at example.com>
Subject: =?utf-8?b?c8qHdcSxb2THnXBvyZQgOC3ihLLiiqXiiKkgx53Kh8qOcS3E?=
 =?utf-8?b?scqHyoNuya8gyaXKh8Sxyo0gx53Gg8mQc3PHncmvIMqHc8edyocgybnHncaDdW/Kgw==?=
'''
>>> from io import StringIO
>>> from email.parser import Parser
>>> from email import policy
>>> msg = Parser(policy=policy.default).parse(StringIO(msgdata))
>>> print(msg['Subject'])  # correct
sʇuıodǝpoɔ 8-Ⅎ⊥∩ ǝʇʎq-ıʇʃnɯ ɥʇıʍ ǝƃɐssǝɯ ʇsǝʇ ɹǝƃuoʃ
>>> print(msg['From'])  # incorrect
ᛗᚫᚱᛏᛁᛃᚾᛈᛁᛖᛏᛖ� �ᛋ <martijn at example.com>

Note the two FFFD placeholders in the From line.

The issue is that the raw value of the From and Subject contain the folding space at the start of the continuation lines:

>>> for name, value in msg.raw_items():
...     if name in {'Subject', 'From'}:
...         print(name, repr(value))
...
>From '=?utf-8?b?4ZuX4Zqr4Zqx4ZuP4ZuB4ZuD4Zq+4ZuI4ZuB4ZuW4ZuP4ZuW4Zo=?=\n =?utf-8?b?seGbiw==?= <martijn at example.com>'
Subject '=?utf-8?b?c8qHdcSxb2THnXBvyZQgOC3ihLLiiqXiiKkgx53Kh8qOcS3E?=\n =?utf-8?b?scqHyoNuya8gyaXKh8Sxyo0gx53Gg8mQc3PHncmvIMqHc8edyocgybnHncaDdW/Kgw==?='

For the Subject header, _header_value_parser.get_unstructured is used, which *expects* there to be spaces between encoded words; it inserts EWWhiteSpaceTerminal tokens in between which are turned into empty strings. But for the From header,  AddressHeader parser does not, the space at the start of the line is retained, and the surrogate escapes at the end of one encoded-word and the start start of the next encoded-word never ajoin, so the later handling of turning surrogates back into proper data fails.

Since unstructured header parsing doesn't mind if a space is missing between encoded-word atoms, the work-around is to explicitly remove the space at the start of every line; this can be done in a custom policy:

import re
from email.policy import EmailPolicy

class UnfoldingHeaderEmailPolicy(EmailPolicy):
    def header_fetch_parse(self, name, value):
        # remove any leading whitespace from header lines
        # before further processing
        value = re.sub(r'(?<=[\n\r])([\t ])', '', value)
        return super().header_fetch_parse(name, value)

custom_policy = UnfoldingHeaderEmailPolicy()

after which the From header comes out without placeholders:

>>> msg = Parser(policy=custom_policy).parse(StringIO(msgdata))
>>> msg['from']
'ᛗᚫᚱᛏᛁᛃᚾᛈᛁᛖᛏᛖᚱᛋ <martijn at example.com>'
>>> msg['subject']
'sʇuıodǝpoɔ 8-Ⅎ⊥∩ ǝʇʎq-ıʇʃnɯ ɥʇıʍ ǝƃɐssǝɯ ʇsǝʇ ɹǝƃuoʃ'

This issue was found by way of https://stackoverflow.com/q/53868584/100297

----------
messages: 332243
nosy: mjpieters
priority: normal
severity: normal
status: open
title: email.parser / email.policy does correctly handle multiple RFC2047 encoded-word tokens across RFC5322 folded headers

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue35547>
_______________________________________