[Email-SIG] fixing the current email module

Glenn Linderman v+python at g.nevcal.com
Tue Oct 6 21:14:37 CEST 2009


On approximately 10/6/2009 7:18 AM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> In the following I use Python 3 terminology: strings are Python
> Unicode objects, and bytes are Python bytes objects.
>
> Glenn Linderman writes:
>
>  > Email messages are bytes.  Usually restricted to bytes in the range 
>  > 32-127, but sometimes permitted to be 0-255 (8bit encoding).
>
> This is irrelevant to our internal representation.  It is both trivial
> and efficient to convert the wire format (bytes) to a string
> internally (at least for email messages up to say 5MB).
>
> Which internal representation makes the most sense depends on what we
> are going to do with that internal representation.  At this point I'm
> not sure that strings are better than bytes, but I'm quite sure that
> I've seen no convincing argument that bytes are TOOWTDI.
>
> Nor is it at all obvious to me that should be stored in wire format.
>   

Yes, I interpreted, possibly misinterpreted, Barry's comment about 
storing things as bytes, as that he was figuring to store them in wire 
format.


>  > Using any other format than email format, means knowing how to
>  > translate that format to/from email format, and to/from API
>  > format... this means coding two translation routines instead of
>  > one.
>
> That sound reasonable, but it's a false economy.  

And this was actually the point I was trying to make.


> The formats you're
> talking about here are the transfer encodings, and we need to be able
> to decode all of them, and produce all of them.  Internally, they can
> be represented by a single format, so you need internal-to-transfer
> and transfer-to-internal for about six of them (7bit, 8bit, binary ==
> Python bytes, BASE64, quoted-printable, Python string)
>   

Not all formats apply to all MIME types, but I think you've enumerated 
the list.

> As for runtime economy, if conversion is done once at parse time and
> once at generate time it is not a big burden, not as compared to the
> overhead of the Python language itself.
>   

I would tend to agree with that, except that if something is 
received/provided in a particular format, it might want to stay in that 
format until such time it is needed in a different format... and then 
the appropriate set of conversions (current format => internal format => 
needed format) applied as needed, avoiding all conversions when it is 
already in the needed format.

>  > The choice of email RFC byte formats
>
> By "byte format", do you mean "wire format"?
>   

Sure, RFC byte formats == wire format.

>  > for the internal form makes it quick and easy to produce a complete
>  > message when called for,
>
> Only for certain kinds of messages, such as automated forwards and
> signed MIME parts, and cron's messages.  For those, there are great
> advantages to spewing things verbatim as you got them off the wire or
> the disk.  But even there, as long as we use the natural embedding of
> bytes in Unicode (ie, interpret bytes as ISO 8859/1) it's easy and not
> particularly inefficient to use strings.
>   

two conversions are slower than none, and use 2-4 times the space in 
string format.

> For anything else, storing in wire format is going to require checking
> format (of the stored data if the format is variable, and always of
> the requesting API) on all attribute accesses, and conversions on
> many, even most attribute accesses.
>   

One has to write the conversion code anyway; it is just a matter of 
where it is called.  Once converted, meta data could be retained in its 
natural format.

>  > One problem with storing messages in bytes format: it seems to me that 
>  > the choice of which of several legal email bytes formats
>
> None of them are very happy.  The email module needs to be able to
> both read and produce all of 7bit, 8bit, and binary, and they are in
> fact pretty well trivial to do.
>
> So the question to me is "what are the primary use cases for the email
> module, and how do they affect the choice of internal representation?"
> I can't claim special expertise on "how", I'll leave that up to
> Barry.  Here are some use cases I can think of.
>   

Yes this is a good question.


> 1.  Debugging programs using the email module.  Maybe that's a +1 for
>     internally storing textual data in string form.
>
> 2.  MUA #1: Composition.  Input will be strings and multimedia file
>     names, output will be bytes.  Will attributes of message objects
>     be manipulated?  Not in a conventional MUA, but an email-based MUA
>     might find uses for that.
>   

I'm not sure what an email-based MUA is.... seems to me even a 
conventional MUA is "email-based"???


> 3.  MUA #2: Reading.  Input will often be bytes (spool files, IMAP
>     data).  Could be strings, though, depending on the internal format
>     of folders.  Output will be strings and multimedia objects.  Lots
>     of string processing, especially generating folder directory
>     displays from message headers.
>
> 4.  Mailing list processor.  Message input will be bytes.
>     Configuration input, including heading and footer texts that may
>     be added are likely to be strings.  Header manipulation (adding
>     topics, sequence numbers, RFC 2369 headers) most conveniently done
>     with strings.  Output will be bytes.
>   

But the bulk of the message parts, received in wire format, may not need 
to be altered to be sent along in the same wire format.  Headers must be 
manipulated somehow, I'd think it would be convenient as strings too.  
Heading and footing texts are configured boilerplate, and could be 
cached in a variety of formats to avoid the need to convert them for 
each message, and could then be obtained from the cache in the 
appropriate format for this particular message, and prepended or 
appended as appropriate.

> 5.  Mailing list archiver.  Input will be bytes or message objects,
>     output will be strings (typically HTML documents or XML
>     fragments).
>   

An archiver could archive wire format, and do the conversions to *ML on 
the fly for those messages that might be accessed that way.  Depends on 
the expectation of the usage of the archiver... to retrieve the archived 
messages via email, wire format could be extremely efficient; to 
retrieve via HTTP, one should note that there is very little difference 
between .eml format (another name for wire format) and .mthml format 
(which is a format IE and Opera will display natively, support in other 
browsers varies, mostly via addons and conversion utilities).  So I'm 
not at all sure that this use case requires string output, although some 
implementations might prefer it.

> 6.  Spam/virus detection.  Input may be bytes or message objects.
>     Lots of internal string processing; in most cases the text/* parts
>     need to be converted to strings before grepping; in some cases
>     even images or executables may be reconstituted to look for
>     malware signatures.  Output may be a flag or signal, or the
>     message itself may be edited (typically to provide headers
>     recording degree of spamminess, trace headers, maybe a body
>     heading; in some cases, a new message may be generated with the
>     suspected spam as a message/rfc822 MIME body part).
>   


So it seems to me that storing the data in the format provided, and 
converting it to native format when requested and caching that result, 
and then when generating wire format, if the needed format was not 
provided or cached, then converting as necessary, would be optimal to 
minimize conversion (time) costs.  This technique would also maximally 
preserve the original format for use cases 3 and 5, which, for use case 
3, at least, seems to be important to this list from past discussion.  
To minimize memory (space) costs, the caching could be avoided (causing 
reconversion costs), or, at the expense of not preserving the original 
format, once converted, retain only the native format of the item (which 
is generally the smallest, for binary objects, and which is most easily 
manipulated, but not necessarily smallest, for text objects).

So I'd design the internal format with meta data like

MIMEpart
    formatFlag
    metaData
    7bitData
    8bitData
    binaryData
    nativeText
    nativeBLOB

where the metaData would consist of a variety of pertinent items, 
obtained by decoding provided wireData or supplied along with provided 
nativeData.

Generate could use 7bitData, 8bitData, or binaryData directly if it 
exists, or cache it there if it didn't already exist.

binaryData would differ from nativeBLOB only by containing the 
appropriate MIMEheaders... perhaps as a space optimization, it would 
contain only the appropriate MIMEheaders, with the binaryData being 
placed in nativeBLOB directly (since this is not a costly conversion, 
just a choice of where to store the bytes).

It could also be possible that a complete, provided, wire format message 
would be retained as a single BLOB, and the appropriate format data 
items simply be offsets and lengths within that BLOB, although with 
cached metaData.

Of course, there is already a design within the existing code, and the 
cost of wholesale redesign may be more than can be afforded.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking



More information about the Email-SIG mailing list