[Email-SIG] fixing the current email module
Stephen J. Turnbull
stephen at xemacs.org
Tue Oct 6 16:18:03 CEST 2009
In the following I use Python 3 terminology: strings are Python
Unicode objects, and bytes are Python bytes objects.
Glenn Linderman writes:
> Email messages are bytes. Usually restricted to bytes in the range
> 32-127, but sometimes permitted to be 0-255 (8bit encoding).
This is irrelevant to our internal representation. It is both trivial
and efficient to convert the wire format (bytes) to a string
internally (at least for email messages up to say 5MB).
Which internal representation makes the most sense depends on what we
are going to do with that internal representation. At this point I'm
not sure that strings are better than bytes, but I'm quite sure that
I've seen no convincing argument that bytes are TOOWTDI.
Nor is it at all obvious to me that should be stored in wire format.
> Using any other format than email format, means knowing how to
> translate that format to/from email format, and to/from API
> format... this means coding two translation routines instead of
> one.
That sound reasonable, but it's a false economy. The formats you're
talking about here are the transfer encodings, and we need to be able
to decode all of them, and produce all of them. Internally, they can
be represented by a single format, so you need internal-to-transfer
and transfer-to-internal for about six of them (7bit, 8bit, binary ==
Python bytes, BASE64, quoted-printable, Python string)
As for runtime economy, if conversion is done once at parse time and
once at generate time it is not a big burden, not as compared to the
overhead of the Python language itself.
> The choice of email RFC byte formats
By "byte format", do you mean "wire format"?
> for the internal form makes it quick and easy to produce a complete
> message when called for,
Only for certain kinds of messages, such as automated forwards and
signed MIME parts, and cron's messages. For those, there are great
advantages to spewing things verbatim as you got them off the wire or
the disk. But even there, as long as we use the natural embedding of
bytes in Unicode (ie, interpret bytes as ISO 8859/1) it's easy and not
particularly inefficient to use strings.
For anything else, storing in wire format is going to require checking
format (of the stored data if the format is variable, and always of
the requesting API) on all attribute accesses, and conversions on
many, even most attribute accesses.
> One problem with storing messages in bytes format: it seems to me that
> the choice of which of several legal email bytes formats
None of them are very happy. The email module needs to be able to
both read and produce all of 7bit, 8bit, and binary, and they are in
fact pretty well trivial to do.
So the question to me is "what are the primary use cases for the email
module, and how do they affect the choice of internal representation?"
I can't claim special expertise on "how", I'll leave that up to
Barry. Here are some use cases I can think of.
1. Debugging programs using the email module. Maybe that's a +1 for
internally storing textual data in string form.
2. MUA #1: Composition. Input will be strings and multimedia file
names, output will be bytes. Will attributes of message objects
be manipulated? Not in a conventional MUA, but an email-based MUA
might find uses for that.
3. MUA #2: Reading. Input will often be bytes (spool files, IMAP
data). Could be strings, though, depending on the internal format
of folders. Output will be strings and multimedia objects. Lots
of string processing, especially generating folder directory
displays from message headers.
4. Mailing list processor. Message input will be bytes.
Configuration input, including heading and footer texts that may
be added are likely to be strings. Header manipulation (adding
topics, sequence numbers, RFC 2369 headers) most conveniently done
with strings. Output will be bytes.
5. Mailing list archiver. Input will be bytes or message objects,
output will be strings (typically HTML documents or XML
fragments).
6. Spam/virus detection. Input may be bytes or message objects.
Lots of internal string processing; in most cases the text/* parts
need to be converted to strings before grepping; in some cases
even images or executables may be reconstituted to look for
malware signatures. Output may be a flag or signal, or the
message itself may be edited (typically to provide headers
recording degree of spamminess, trace headers, maybe a body
heading; in some cases, a new message may be generated with the
suspected spam as a message/rfc822 MIME body part).
More information about the Email-SIG
mailing list