[Email-SIG] fixing the current email module
Glenn Linderman
v+python at g.nevcal.com
Tue Oct 6 21:14:37 CEST 2009
On approximately 10/6/2009 7:18 AM, came the following characters from
the keyboard of Stephen J. Turnbull:
> In the following I use Python 3 terminology: strings are Python
> Unicode objects, and bytes are Python bytes objects.
>
> Glenn Linderman writes:
>
> > Email messages are bytes. Usually restricted to bytes in the range
> > 32-127, but sometimes permitted to be 0-255 (8bit encoding).
>
> This is irrelevant to our internal representation. It is both trivial
> and efficient to convert the wire format (bytes) to a string
> internally (at least for email messages up to say 5MB).
>
> Which internal representation makes the most sense depends on what we
> are going to do with that internal representation. At this point I'm
> not sure that strings are better than bytes, but I'm quite sure that
> I've seen no convincing argument that bytes are TOOWTDI.
>
> Nor is it at all obvious to me that should be stored in wire format.
>
Yes, I interpreted, possibly misinterpreted, Barry's comment about
storing things as bytes, as that he was figuring to store them in wire
format.
> > Using any other format than email format, means knowing how to
> > translate that format to/from email format, and to/from API
> > format... this means coding two translation routines instead of
> > one.
>
> That sound reasonable, but it's a false economy.
And this was actually the point I was trying to make.
> The formats you're
> talking about here are the transfer encodings, and we need to be able
> to decode all of them, and produce all of them. Internally, they can
> be represented by a single format, so you need internal-to-transfer
> and transfer-to-internal for about six of them (7bit, 8bit, binary ==
> Python bytes, BASE64, quoted-printable, Python string)
>
Not all formats apply to all MIME types, but I think you've enumerated
the list.
> As for runtime economy, if conversion is done once at parse time and
> once at generate time it is not a big burden, not as compared to the
> overhead of the Python language itself.
>
I would tend to agree with that, except that if something is
received/provided in a particular format, it might want to stay in that
format until such time it is needed in a different format... and then
the appropriate set of conversions (current format => internal format =>
needed format) applied as needed, avoiding all conversions when it is
already in the needed format.
> > The choice of email RFC byte formats
>
> By "byte format", do you mean "wire format"?
>
Sure, RFC byte formats == wire format.
> > for the internal form makes it quick and easy to produce a complete
> > message when called for,
>
> Only for certain kinds of messages, such as automated forwards and
> signed MIME parts, and cron's messages. For those, there are great
> advantages to spewing things verbatim as you got them off the wire or
> the disk. But even there, as long as we use the natural embedding of
> bytes in Unicode (ie, interpret bytes as ISO 8859/1) it's easy and not
> particularly inefficient to use strings.
>
two conversions are slower than none, and use 2-4 times the space in
string format.
> For anything else, storing in wire format is going to require checking
> format (of the stored data if the format is variable, and always of
> the requesting API) on all attribute accesses, and conversions on
> many, even most attribute accesses.
>
One has to write the conversion code anyway; it is just a matter of
where it is called. Once converted, meta data could be retained in its
natural format.
> > One problem with storing messages in bytes format: it seems to me that
> > the choice of which of several legal email bytes formats
>
> None of them are very happy. The email module needs to be able to
> both read and produce all of 7bit, 8bit, and binary, and they are in
> fact pretty well trivial to do.
>
> So the question to me is "what are the primary use cases for the email
> module, and how do they affect the choice of internal representation?"
> I can't claim special expertise on "how", I'll leave that up to
> Barry. Here are some use cases I can think of.
>
Yes this is a good question.
> 1. Debugging programs using the email module. Maybe that's a +1 for
> internally storing textual data in string form.
>
> 2. MUA #1: Composition. Input will be strings and multimedia file
> names, output will be bytes. Will attributes of message objects
> be manipulated? Not in a conventional MUA, but an email-based MUA
> might find uses for that.
>
I'm not sure what an email-based MUA is.... seems to me even a
conventional MUA is "email-based"???
> 3. MUA #2: Reading. Input will often be bytes (spool files, IMAP
> data). Could be strings, though, depending on the internal format
> of folders. Output will be strings and multimedia objects. Lots
> of string processing, especially generating folder directory
> displays from message headers.
>
> 4. Mailing list processor. Message input will be bytes.
> Configuration input, including heading and footer texts that may
> be added are likely to be strings. Header manipulation (adding
> topics, sequence numbers, RFC 2369 headers) most conveniently done
> with strings. Output will be bytes.
>
But the bulk of the message parts, received in wire format, may not need
to be altered to be sent along in the same wire format. Headers must be
manipulated somehow, I'd think it would be convenient as strings too.
Heading and footing texts are configured boilerplate, and could be
cached in a variety of formats to avoid the need to convert them for
each message, and could then be obtained from the cache in the
appropriate format for this particular message, and prepended or
appended as appropriate.
> 5. Mailing list archiver. Input will be bytes or message objects,
> output will be strings (typically HTML documents or XML
> fragments).
>
An archiver could archive wire format, and do the conversions to *ML on
the fly for those messages that might be accessed that way. Depends on
the expectation of the usage of the archiver... to retrieve the archived
messages via email, wire format could be extremely efficient; to
retrieve via HTTP, one should note that there is very little difference
between .eml format (another name for wire format) and .mthml format
(which is a format IE and Opera will display natively, support in other
browsers varies, mostly via addons and conversion utilities). So I'm
not at all sure that this use case requires string output, although some
implementations might prefer it.
> 6. Spam/virus detection. Input may be bytes or message objects.
> Lots of internal string processing; in most cases the text/* parts
> need to be converted to strings before grepping; in some cases
> even images or executables may be reconstituted to look for
> malware signatures. Output may be a flag or signal, or the
> message itself may be edited (typically to provide headers
> recording degree of spamminess, trace headers, maybe a body
> heading; in some cases, a new message may be generated with the
> suspected spam as a message/rfc822 MIME body part).
>
So it seems to me that storing the data in the format provided, and
converting it to native format when requested and caching that result,
and then when generating wire format, if the needed format was not
provided or cached, then converting as necessary, would be optimal to
minimize conversion (time) costs. This technique would also maximally
preserve the original format for use cases 3 and 5, which, for use case
3, at least, seems to be important to this list from past discussion.
To minimize memory (space) costs, the caching could be avoided (causing
reconversion costs), or, at the expense of not preserving the original
format, once converted, retain only the native format of the item (which
is generally the smallest, for binary objects, and which is most easily
manipulated, but not necessarily smallest, for text objects).
So I'd design the internal format with meta data like
MIMEpart
formatFlag
metaData
7bitData
8bitData
binaryData
nativeText
nativeBLOB
where the metaData would consist of a variety of pertinent items,
obtained by decoding provided wireData or supplied along with provided
nativeData.
Generate could use 7bitData, 8bitData, or binaryData directly if it
exists, or cache it there if it didn't already exist.
binaryData would differ from nativeBLOB only by containing the
appropriate MIMEheaders... perhaps as a space optimization, it would
contain only the appropriate MIMEheaders, with the binaryData being
placed in nativeBLOB directly (since this is not a costly conversion,
just a choice of where to store the bytes).
It could also be possible that a complete, provided, wire format message
would be retained as a single BLOB, and the appropriate format data
items simply be offsets and lengths within that BLOB, although with
cached metaData.
Of course, there is already a design within the existing code, and the
cost of wholesale redesign may be more than can be afforded.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Email-SIG
mailing list