[Email-SIG] fixing the current email module
Glenn Linderman
v+python at g.nevcal.com
Wed Oct 7 04:52:39 CEST 2009
On approximately 10/6/2009 5:30 PM, came the following characters from
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
>
> > Yes, I interpreted, possibly misinterpreted, Barry's comment about
> > storing things as bytes, as that he was figuring to store them in wire
> > format.
>
> What that means is unclear, though. Does a "header in wire format"
> mean before or after MIME encoding? Probably after, but that's pretty
> useless for the purpose of editing the header. Does it include the
> tag (the part before the colon) or not? Etc.
>
> > I would tend to agree with that, except that if something is
> > received/provided in a particular format, it might want to stay in that
> > format until such time it is needed in a different format... and then
> > the appropriate set of conversions (current format => internal format =>
> > needed format) applied as needed, avoiding all conversions when it is
> > already in the needed format.
>
> If you mean that the email module will keep track of what form the
> object is currently represented by, that will eventually result in
> "UnicodeError: octet out of range: 161, ascii".
>
The above sentence does not communicate your meaning to me... or any
meaning, actually. Can you explain?
If conversions are avoided, then octets are unlikely to be out of
range? And the email module must be aware of the form of the data in
order to manipulate it in any format other than wire format, but
fortunately, wire format declares the format of the data (not to say
there is not buggy wire format data -- but that is an issue best avoided
by avoiding as many conversions as possible).
> > two conversions are slower than none, and use 2-4 times the space in
> > string format.
>
> Let's get this correct, *then* optimize, please.
>
That's a nice platitude... I could have used it on you when you said
> As for runtime economy, if conversion is done once at parse time and
> once at generate time it is not a big burden, not as compared to the
> overhead of the Python language itself.
but I didn't. You can't design things totally ignoring the reality of
time and space performance, and expect to get an efficient result. I
agree one can spend too much time on premature optimization issues, and
I have that tendency, but if you totally ignore time and space issues,
you wind up with Vista.
> > One has to write the conversion code anyway; it is just a matter of
> > where it is called. Once converted, meta data could be retained in its
> > natural format.
>
> Meta data for what? Why would you convert meta data?
>
Meta data for the email message... how many MIME parts, their
Content-Types, etc. This is small amounts of data, but reasonably
likely to referenced multiple times during the message parsing or
creation and generation process. So once it is converted from wire
format, it should be kept in a useful format, as well as wire format.
> > > 2. MUA #1: Composition. Input will be strings and multimedia file
> > > names, output will be bytes. Will attributes of message objects
> > > be manipulated? Not in a conventional MUA, but an email-based MUA
> > > might find uses for that.
> >
> > I'm not sure what an email-based MUA is.... seems to me even a
> > conventional MUA is "email-based"???
>
> Only if it's written using the Python email module.
>
Um. Aren't we talking about use cases for the Python email module? I
was trying to interpret what you were saying in that light. Sure, what
a conventional (not written using the Python email module) MUA does, is
mostly irrelevant, except so far as it shows use cases that might be
applied to email-based (written using the Python email module) MUAs.
> > > 4. Mailing list processor. Message input will be bytes.
> > > Configuration input, including heading and footer texts that may
> > > be added are likely to be strings. Header manipulation (adding
> > > topics, sequence numbers, RFC 2369 headers) most conveniently done
> > > with strings. Output will be bytes.
> > >
> >
> > But the bulk of the message parts, received in wire format, may not need
> > to be altered to be sent along in the same wire format.
>
> That depends. For example, multimedia parts may simply be discarded,
> in which case it makes sense to not convert them. However, most
> Mailman lists do add a footer, and because of crappy Windows MUAs that
> don't implement MIME correctly, it's preferred to add that by
> concatenating as text. That simply cannot be done correctly in wire
> format for any character set except ISO 8859/1.
>
Huh?
First off, which "crappy Windows MUAs" don't implement MIME correctly,
and what do they do wrong? When I look at wire format emails, I'm
mostly appalled by the stuff generated by Apple Mail. I have seen a few
doozies from Outlook 2000, but they seem to be fixed in newer versions.
Adding a header or trailer does require knowledge of the character set
and encoding of the message. Given that, you can decode to str, add the
header or trailer and encode back to MIME. So that's the inefficient
proof of concept.
In the identity or quopri encodings, it is possible to add similarly
encoded headers and trailers correctly to text/plain parts through
normal concatenation. Adding headers to base64 encoding requires that
the encoded header be an exact number of base64 lines, or at least a
multiple of 3 characters and that you shuffle the line layout through
the whole base64 body... it is not clear that this is worth the work.
Adding trailers to base64 encoding requires decoding the final partial
encoding, noticing how much room is left on that last line, and the
encoding from there on... so it is not possible to cache an encoded
base64 footer, although it would be possible to cache 3 of them, and
only have to tweak the merge and choose the right one of the three and
then reshuffle. So since text/plain is seldom encoded in base64, and
base64 is so complex to concatenate to in wire format, I'd think it
would be a better choice to decode and reencode to concatenate headers
or footers to base64 encoded MIME parts.... unless immense base64
encoded MIME parts are expected to be common enough to develop the
optimized logic.
text/html is trickier, whether encoded or not. You have to parse past
any stuff that precedes <body>, and place the header after that, and
then you have to find the </body> and place the trailer before that.
And unless you run the HTML through a validity checker, you can't be
sure that the trailer will even show up, much less actually at the
bottom, due to the possibility of unclosed tags within the body. To
parse even quopri encoded HTML gets tricky, and basically impossible for
base64 encoded HTML. So the first text/html part likely will need to be
decoded for adding headers and trailers, if it is an alternative to the
text/plain part, or there is no text/plain part.
I've seen some systems add an additional MIME part to place a trailer
in, and that can be pretty effective for MUAs that will show multiple
parts in-line, but there are so many MUAs out there, that it is
extremely difficult to make any certain declarations regarding what the
user sees as a result.
And, ISO 8859/1 is an 8-bit character set, so would require encoding on
a 7bit transfer. But it is not unique; if you know how to do ISO 8859/1
concatenation in wire format, then you can do the whole class of
ASCII+128 more character sets in the same manner. Not to mention that
ASCII itself works fine in wire format. And so does UTF-8. It is just
a matter of matching the character set and the encoding.
> > Heading and footing texts are configured boilerplate, and could be
> > cached in a variety of formats to avoid the need to convert them for
> > each message,
>
> Premature optimization is the root of all error.
>
Yeah, yeah. I said "could", not "must". I was pushing back from your
declaration that:
> Configuration input, including heading and footer texts that may
> be added are likely to be strings.
Such configuration texts are likely to provided as strings, but there is
nothing to prevent them from being converted to other formats.
Premature optimization may or may not be the root of all error, but
discarding perfectly valid design possibilities based on how the input
might be supplied seems a similar error. I'm not declaring which design
is best, just that there are alternatives.
> > An archiver could archive wire format,
>
> Are you suggesting that the email module should mandate that? We have
> a severe tail-dog inversion problem here.
Absolutely not. I said "could", not "must". The archiver can do what
it wants. The email library should provide access to the message data
in all useful formats, so that the archiver can do what it wants. The
archiver needs to choose its design and optimizations appropriate for
its expected use cases. I was pushing back from your declaration that
an archiver would always want string output.... you said:
> 5. Mailing list archiver. Input will be bytes or message objects,
> output will be strings (typically HTML documents or XML
> fragments).
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Email-SIG
mailing list