[Email-SIG] fixing the current email module

Barry Warsaw barry at python.org
Thu Oct 8 15:00:31 CEST 2009


On Oct 8, 2009, at 3:29 AM, Glenn Linderman wrote:

> Great anecdote!  Spammers shooting themselves in the foot with their  
> ignorance.

Indeed.  It constantly surprises me that spam would be so malformed,  
but I guess it could make perverse sense if say, you were trying to  
DoS a spam filter.

> Seems to me that when there is an error in an encoded base64 MIME  
> part, such that it can't be base64 decoded, the options for the  
> library are:
> return an error, the data is likely meaningless
> allow the bytes to be retrieved, undecoded
> I suppose it might be possible to skip only those 4-character  
> sequences that don't decode properly, and try to decode the rest of  
> the data, if it is text.    But some way to flag that data were  
> undecodable would be needed.
> And if it is text, then it must then undergo charset decoding (below).

Note that while I'm adamant that the parser and generator not raise  
exceptions, what the model does is a different matter.  Ideally,  
accessing data from the model would never raise an exception either,  
but mutating the model could.  This is just basic Postel's Law.

> The application options are to drop the attachment, or pass through  
> the corrupted bytes, and let the next application try to make sense  
> of it.

Exactly, and it's not for the email package to say which is right.

Here's a use case: I've got a Message that was parsed from wire input  
and I want to mangle the Subject heading to add the list prefix.  I  
know exactly what charset the prefix is in because that's data I  
control.  When I ask for the original Subject value, I'm handed an  
instance that I can use to try to figure out how add the prefix.

First thing I'll ask it is "are you a single chunk in my prefix  
charset (or compatible)?"  If so, I can probably just prepend my  
prefix onto the value.  If not, "are you composed of multiple valid  
chunks in different charsets?"  If so, I know that I need to encode my  
prefix, but I can still prepend it to the header value (hopefully  
using the same API, and I don't care that the implementation could not  
use string concatenation).

If not, then what?  Maybe I don't care if some of the chunk charsets  
aren't known because I can still use the right encode+prepend  
strategy.  But if the header is a gobbledegook of 8-bit bytes?  I'm  
pretty sure I want to be able to ask the API if that's the case rather  
than get an exception.  The thing I'm not so sure about is what  
happens if my application is just naive enough to just ask for the  
header as a unicode and that conversion can't be made.  I /think/ it  
should raise an exception in that case.  But then when I ask for the  
header value as a mass of bytes, that should succeed and return me the  
raw input.

> And I agree that APIs to retrieve any MIME part as undecoded bytes  
> is appropriate; and to retrieve it as decoded strings is appropriate  
> for text MIME parts.  Not sure that non-text MIME parts need to  
> support being returned as strings.

I hate to open another can of worms, but I've been thinking about this  
a lot too :).  It's been discussed on list before, so nothing new  
here.  I think the parser and MIME classes need to be hookable for  
decoding their contents.  For example, if you have a text/* it might  
well make sense to support bytes() and str()/unicode() on the part  
instance.  But if it's image/* str() makes no sense.  part.decode() or  
something similar makes sense, but this needs to be extensible because  
the email package will not know how to convert every content-type.  At  
best it will only know how to decode content-types that Python's  
stdlib knows about.

The problem is that if the bytes came off the wire, the parser  
currently can only attach the most basic MIME base class.  It doesn't  
know that an image/png should create a MIMEImagePNG instance there.   
This is different from hacking the model directly because the  
application can instantiate the right class.  So the parser either has  
to have a hookable way for an application to go from content-type to  
class, or the generic MIME base class needs to be hookable in  
its .decode() method.

> Headers could possibly be a quadruple instead of a triple, with the  
> 4th item being the wire format if received? (If constructed, no wire  
> format would be expected until it is generated.)  That would help  
> with idempotency, as if a header contains non-ASCII characters,  
> there are many choices of heuristic to encode that are all proper,  
> so it is unlikely two different algorithms would preserve idempotency.

I think not a quad.  I think other APIs should be used to extract the  
raw data, e.g.

 >>> # return a unicode or throw an exception
 >>> text = str(header)
 >>> # should always be okay even if gibberish
 >>> raw = bytes(header)

or /something/ like that.

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091008/7bda1cc4/attachment.pgp>


More information about the Email-SIG mailing list