[Email-SIG] fixing the current email module
Glenn Linderman
v+python at g.nevcal.com
Thu Oct 8 09:29:41 CEST 2009
On approximately 10/7/2009 7:40 PM, came the following characters from
the keyboard of Barry Warsaw:
> On Oct 7, 2009, at 6:33 AM, Stephen J. Turnbull wrote:
>> Haven't looked in your spam bucket recently, I guess. Spammers
>> regularly put 8 bit characters into headers (and into bodies in
>> messages without a Content-Type header), for one thing.
> Interesting story: Launchpad (which is open source now so there are no
> secrets) uses XMLRPC when Mailman holds a message for moderation,
> storing it in Launchpad's database for display to the list (team)
> owner. Well, I was lazy, stupid, or both and didn't wrap the objects
> in a Binary over the wire, so we were getting tons of failures here.
> But none of them seemed to have any practical effect on user
> experience (read: we got zero bug reports for missing held messages).
>
> I finally found the time to debug the problem, because the failures in
> themselves were cryptic and common enough to cause our operations
> people headaches. So I cowboyed in some additional capture code and
> ran it for 24 hours. Guess what I found?
>
> We were essentially crapping out on /tons/ of messages with 8-bit in
> headers, and these messages were basically getting dropped on the
> floor. Why no bug reports? Because /every/ single captured message
> was spam. How's that for a bug having unintended positive consequences?
Great anecdote! Spammers shooting themselves in the foot with their
ignorance. But still, much too much spam gets through.
Seems to me that when there is an error in an encoded base64 MIME part,
such that it can't be base64 decoded, the options for the library are:
return an error, the data is likely meaningless
allow the bytes to be retrieved, undecoded
I suppose it might be possible to skip only those 4-character sequences
that don't decode properly, and try to decode the rest of the data, if
it is text. But some way to flag that data were undecodable would be
needed.
And if it is text, then it must then undergo charset decoding (below).
The application options are to drop the attachment, or pass through the
corrupted bytes, and let the next application try to make sense of it.
A quopri MIME part that can't be correctly decoded may still be mostly
readable... so here it makes sense to return an error but also the data,
decoded as best as possible. Applications choices are basically the
same. Once quopri decoded, then text parts must also face charset
decoding (below).
Charset decoding: a charset should be specified, or is assumed to be
ASCII by default. If a text MIME part that isn't in the right character
set gets decode errors, there are several possibilities:
return an error, and the decoded data, with error substitutions
allow the bytes to be retrieved
decode as Latin-1 (no errors possible, but probably results in mojibake)
The application options are to drop the attachment, or choose to pass
through one of the three data values.
For headers, the choices are basically the same as for text MIME parts,
but some headers that contain meta data (rather just text like Subject:)
may be critical to proper decoding of other data, and so errors in some
headers can cause incorrect behaviour of other headers or of an
associated MIME part.
And I agree that APIs to retrieve any MIME part as undecoded bytes is
appropriate; and to retrieve it as decoded strings is appropriate for
text MIME parts. Not sure that non-text MIME parts need to support
being returned as strings.
Headers could possibly be a quadruple instead of a triple, with the 4th
item being the wire format if received? (If constructed, no wire format
would be expected until it is generated.) That would help with
idempotency, as if a header contains non-ASCII characters, there are
many choices of heuristic to encode that are all proper, so it is
unlikely two different algorithms would preserve idempotency.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Email-SIG
mailing list