[Email-SIG] fixing the current email module

Glenn Linderman v+python at g.nevcal.com
Wed Oct 7 04:52:39 CEST 2009


On approximately 10/6/2009 5:30 PM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
>
>  > Yes, I interpreted, possibly misinterpreted, Barry's comment about 
>  > storing things as bytes, as that he was figuring to store them in wire 
>  > format.
>
> What that means is unclear, though.  Does a "header in wire format"
> mean before or after MIME encoding?  Probably after, but that's pretty
> useless for the purpose of editing the header.  Does it include the
> tag (the part before the colon) or not?  Etc.
>
>  > I would tend to agree with that, except that if something is 
>  > received/provided in a particular format, it might want to stay in that 
>  > format until such time it is needed in a different format... and then 
>  > the appropriate set of conversions (current format => internal format => 
>  > needed format) applied as needed, avoiding all conversions when it is 
>  > already in the needed format.
>
> If you mean that the email module will keep track of what form the
> object is currently represented by, that will eventually result in
> "UnicodeError: octet out of range: 161, ascii".
>   

The above sentence does not communicate your meaning to me... or any 
meaning, actually.  Can you explain?
If conversions are avoided, then octets are unlikely to be out of 
range?  And the email module must be aware of the form of the data in 
order to manipulate it in any format other than wire format, but 
fortunately, wire format declares the format of the data (not to say 
there is not buggy wire format data -- but that is an issue best avoided 
by avoiding as many conversions as possible).

>  > two conversions are slower than none, and use 2-4 times the space in 
>  > string format.
>
> Let's get this correct, *then* optimize, please.
>   

That's a nice platitude... I could have used it on you when you said
> As for runtime economy, if conversion is done once at parse time and
> once at generate time it is not a big burden, not as compared to the
> overhead of the Python language itself.
but I didn't.  You can't design things totally ignoring the reality of 
time and space performance, and expect to get an efficient result.  I 
agree one can spend too much time on premature optimization issues, and 
I have that tendency, but if you totally ignore time and space issues, 
you wind up with Vista.


>  > One has to write the conversion code anyway; it is just a matter of 
>  > where it is called.  Once converted, meta data could be retained in its 
>  > natural format.
>
> Meta data for what?  Why would you convert meta data?
>   

Meta data for the email message... how many MIME parts, their 
Content-Types, etc.  This is small amounts of data, but reasonably 
likely to referenced multiple times during the message parsing or 
creation and generation process.  So once it is converted from wire 
format, it should be kept in a useful format, as well as wire format.


>  > > 2.  MUA #1: Composition.  Input will be strings and multimedia file
>  > >     names, output will be bytes.  Will attributes of message objects
>  > >     be manipulated?  Not in a conventional MUA, but an email-based MUA
>  > >     might find uses for that.
>  > 
>  > I'm not sure what an email-based MUA is.... seems to me even a 
>  > conventional MUA is "email-based"???
>
> Only if it's written using the Python email module.
>   

Um.  Aren't we talking about use cases for the Python email module?  I 
was trying to interpret what you were saying in that light.  Sure, what 
a conventional (not written using the Python email module) MUA does, is 
mostly irrelevant, except so far as it shows use cases that might be 
applied to email-based (written using the Python email module) MUAs.


>  > > 4.  Mailing list processor.  Message input will be bytes.
>  > >     Configuration input, including heading and footer texts that may
>  > >     be added are likely to be strings.  Header manipulation (adding
>  > >     topics, sequence numbers, RFC 2369 headers) most conveniently done
>  > >     with strings.  Output will be bytes.
>  > >   
>  > 
>  > But the bulk of the message parts, received in wire format, may not need 
>  > to be altered to be sent along in the same wire format.
>
> That depends.  For example, multimedia parts may simply be discarded,
> in which case it makes sense to not convert them.  However, most
> Mailman lists do add a footer, and because of crappy Windows MUAs that
> don't implement MIME correctly, it's preferred to add that by
> concatenating as text.  That simply cannot be done correctly in wire
> format for any character set except ISO 8859/1.
>   

Huh?

First off, which "crappy Windows MUAs" don't implement MIME correctly, 
and what do they do wrong?  When I look at wire format emails, I'm 
mostly appalled by the stuff generated by Apple Mail.  I have seen a few 
doozies from Outlook 2000, but they seem to be fixed in newer versions.

Adding a header or trailer does require knowledge of the character set 
and encoding of the message.  Given that, you can decode to str, add the 
header or trailer and encode back to MIME.  So that's the inefficient 
proof of concept.

In the identity or quopri encodings, it is possible to add similarly 
encoded headers and trailers correctly to text/plain parts through 
normal concatenation.  Adding headers to base64 encoding requires that 
the encoded header be an exact number of base64 lines, or at least a 
multiple of 3 characters and that you shuffle the line layout through 
the whole base64 body... it is not clear that this is worth the work.  
Adding trailers to base64 encoding requires decoding the final partial 
encoding, noticing how much room is left on that last line, and the 
encoding from there on... so it is not possible to cache an encoded 
base64 footer, although it would be possible to cache 3 of them, and 
only have to tweak the merge and choose the right one of the three and 
then reshuffle.  So since text/plain is seldom encoded in base64, and 
base64 is so complex to concatenate to in wire format, I'd think it 
would be a better choice to decode and reencode to concatenate headers 
or footers to base64 encoded MIME parts.... unless immense base64 
encoded MIME parts are expected to be common enough to develop the 
optimized logic.

text/html is trickier, whether encoded or not.  You have to parse past 
any stuff that precedes <body>, and place the header after that, and 
then you have to find the </body> and place the trailer before that.  
And unless you run the HTML through a validity checker, you can't be 
sure that the trailer will even show up, much less actually at the 
bottom, due to the possibility of unclosed tags within the body.  To 
parse even quopri encoded HTML gets tricky, and basically impossible for 
base64 encoded HTML.  So the first text/html part likely will need to be 
decoded for adding headers and trailers, if it is an alternative to the 
text/plain part, or there is no text/plain part.

I've seen some systems add an additional MIME part to place a trailer 
in, and that can be pretty effective for MUAs that will show multiple 
parts in-line, but there are so many MUAs out there, that it is 
extremely difficult to make any certain declarations regarding what the 
user sees as a result.

And, ISO 8859/1 is an 8-bit character set, so would require encoding on 
a 7bit transfer.  But it is not unique; if you know how to do ISO 8859/1 
concatenation in wire format, then you can do the whole class of 
ASCII+128 more character sets in the same manner.  Not to mention that 
ASCII itself works fine in wire format.  And so does UTF-8.  It is just 
a matter of matching the character set and the encoding.


>  > Heading and footing texts are configured boilerplate, and could be 
>  > cached in a variety of formats to avoid the need to convert them for 
>  > each message,
>
> Premature optimization is the root of all error.
>   

Yeah, yeah.  I said "could", not "must".  I was pushing back from your 
declaration that:

>     Configuration input, including heading and footer texts that may
>     be added are likely to be strings.  

Such configuration texts are likely to provided as strings, but there is 
nothing to prevent them from being converted to other formats.  
Premature optimization may or may not be the root of all error, but 
discarding perfectly valid design possibilities based on how the input 
might be supplied seems a similar error.  I'm not declaring which design 
is best, just that there are alternatives.

>  > An archiver could archive wire format,
>
> Are you suggesting that the email module should mandate that?  We have
> a severe tail-dog inversion problem here.

Absolutely not.  I said "could", not "must".  The archiver can do what 
it wants.  The email library should provide access to the message data 
in all useful formats, so that the archiver can do what it wants.  The 
archiver needs to choose its design and optimizations appropriate for 
its expected use cases.  I was pushing back from your declaration that 
an archiver would always want string output.... you said:

> 5.  Mailing list archiver.  Input will be bytes or message objects,
>     output will be strings (typically HTML documents or XML
>     fragments).

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking



More information about the Email-SIG mailing list