[Email-SIG] Some parsing/generation issues of email in Python 3
Hans-Peter Jansen
hpj at urpla.net
Wed Jun 8 05:56:40 EDT 2016
Dear audience,
when coming back to this list, I couldn't believe my eyes because of the low
volume level, but after rechecking with the archives, I have to accept, it is
that quiet here, a bit too quiet from my POV. Hmm.
Well, I'm in the course of replacing a special purpose postfix email filter,
that is dating back to 2004 with a redeveloped Python 3 version right now.
Basically all it is doing (in pseudo code):
msg = email.message_from_file(fp)
processing(msg)
write(msg.as_string(True))
for a few 100 million mails during that time.
After replacing it with:
msg = email.message_from_binary_file(fp, policy = email.policy.SMTP)
processing(msg)
BytesGenerator(pipe).flatten(msg)
Here, processing mostly saves bodies and attachments, depending on pattern
matches and adds some headers.
I was quite astonished to find out, that this procedure isn't working that
well anymore: the email module appears way more sensible in the current state.
This is a bit disappointing, as reading the docs conveys, that some effort was
put into reliability and robustness. Given the much improved unicode handling
of Python 3 itself and the ever improving experience in handling emails, this
is contrary to my expectations, I have to confess.
Minutes after switching to the new code, I stumbled across a traceback in
msg.get_all('to') from a header like this:
To: unlisted-recipients: ;,
""@pop.kundenserver.de (no To-header on input)
Hmm, not nice. http://bugs.python.org/issue27257
Next, I wondered, that arbitrary header data appears in the body of some mail
in my MUA. Tracked down to a mangled header, that has lost proper indentation:
X-Microsoft-Exchange-Diagnostics:
=?utf-8?B?MTtCTDJQUjAyTUI1MTQ7MjM6bEtRRlNaUHQvVTk5WCttdktlOUVrUGQvVFBH?=
=?utf-8?B?cDFJemVUeXFzOGNzYnZOYWlwMDZpR0YzbXZyY09WaTBKM2pkeUl4S1VDMkxw?=
=?utf-8?B?eVRkNWthRW9waUhJTzczTWd5WDZOQ3hMNU1haGFvQTVzVTdRZmxJUnZlblpW?=
...
versus:
X-Microsoft-Exchange-Diagnostics:
1;BL2PR02MB514;23:lKQFSZPt/U99X+mvKe9EkPd/TPG
p1IzeTyqs8csbvNaip06iGF3mvrcOVi0J3jdyIxKUC2Lp
yTd5kaEopiHIO73MgyX6NCxL5MahaoA5sU7QflIRvenZV
Oh, well. http://bugs.python.org/issue27256
Before I added some code to circumvent those occurrences, I stumbled across a
traceback in flatten: http://bugs.python.org/issue27258
All these issues were harvested in less than halve an hour. What really
troubles me is the quietness around here in the light of this experience.
Doesn't people use Python (3) yet/anymore for these kind of tasks? Does
somebody care? Am I missing something?
I will do my best to dive into these issues in the next days/weeks, but would
appreciate a dialog with somebody, who is involved in the email module code
already.
Thanks,
Pete
More information about the Email-SIG
mailing list