[Email-SIG] fixing the current email module

Stephen J. Turnbull stephen at xemacs.org
Tue Oct 6 16:18:03 CEST 2009


In the following I use Python 3 terminology: strings are Python
Unicode objects, and bytes are Python bytes objects.

Glenn Linderman writes:

 > Email messages are bytes.  Usually restricted to bytes in the range 
 > 32-127, but sometimes permitted to be 0-255 (8bit encoding).

This is irrelevant to our internal representation.  It is both trivial
and efficient to convert the wire format (bytes) to a string
internally (at least for email messages up to say 5MB).

Which internal representation makes the most sense depends on what we
are going to do with that internal representation.  At this point I'm
not sure that strings are better than bytes, but I'm quite sure that
I've seen no convincing argument that bytes are TOOWTDI.

Nor is it at all obvious to me that should be stored in wire format.

 > Using any other format than email format, means knowing how to
 > translate that format to/from email format, and to/from API
 > format... this means coding two translation routines instead of
 > one.

That sound reasonable, but it's a false economy.  The formats you're
talking about here are the transfer encodings, and we need to be able
to decode all of them, and produce all of them.  Internally, they can
be represented by a single format, so you need internal-to-transfer
and transfer-to-internal for about six of them (7bit, 8bit, binary ==
Python bytes, BASE64, quoted-printable, Python string)

As for runtime economy, if conversion is done once at parse time and
once at generate time it is not a big burden, not as compared to the
overhead of the Python language itself.

 > The choice of email RFC byte formats

By "byte format", do you mean "wire format"?

 > for the internal form makes it quick and easy to produce a complete
 > message when called for,

Only for certain kinds of messages, such as automated forwards and
signed MIME parts, and cron's messages.  For those, there are great
advantages to spewing things verbatim as you got them off the wire or
the disk.  But even there, as long as we use the natural embedding of
bytes in Unicode (ie, interpret bytes as ISO 8859/1) it's easy and not
particularly inefficient to use strings.

For anything else, storing in wire format is going to require checking
format (of the stored data if the format is variable, and always of
the requesting API) on all attribute accesses, and conversions on
many, even most attribute accesses.

 > One problem with storing messages in bytes format: it seems to me that 
 > the choice of which of several legal email bytes formats

None of them are very happy.  The email module needs to be able to
both read and produce all of 7bit, 8bit, and binary, and they are in
fact pretty well trivial to do.

So the question to me is "what are the primary use cases for the email
module, and how do they affect the choice of internal representation?"
I can't claim special expertise on "how", I'll leave that up to
Barry.  Here are some use cases I can think of.

1.  Debugging programs using the email module.  Maybe that's a +1 for
    internally storing textual data in string form.

2.  MUA #1: Composition.  Input will be strings and multimedia file
    names, output will be bytes.  Will attributes of message objects
    be manipulated?  Not in a conventional MUA, but an email-based MUA
    might find uses for that.

3.  MUA #2: Reading.  Input will often be bytes (spool files, IMAP
    data).  Could be strings, though, depending on the internal format
    of folders.  Output will be strings and multimedia objects.  Lots
    of string processing, especially generating folder directory
    displays from message headers.

4.  Mailing list processor.  Message input will be bytes.
    Configuration input, including heading and footer texts that may
    be added are likely to be strings.  Header manipulation (adding
    topics, sequence numbers, RFC 2369 headers) most conveniently done
    with strings.  Output will be bytes.

5.  Mailing list archiver.  Input will be bytes or message objects,
    output will be strings (typically HTML documents or XML
    fragments).

6.  Spam/virus detection.  Input may be bytes or message objects.
    Lots of internal string processing; in most cases the text/* parts
    need to be converted to strings before grepping; in some cases
    even images or executables may be reconstituted to look for
    malware signatures.  Output may be a flag or signal, or the
    message itself may be edited (typically to provide headers
    recording degree of spamminess, trace headers, maybe a body
    heading; in some cases, a new message may be generated with the
    suspected spam as a message/rfc822 MIME body part).



More information about the Email-SIG mailing list