[Email-SIG] fixing the current email module

Thu Oct 8 04:05:08 CEST 2009

On Oct 6, 2009, at 10:18 AM, Stephen J. Turnbull wrote:

> In the following I use Python 3 terminology: strings are Python
> Unicode objects, and bytes are Python bytes objects.

Exactly.  8-bit strings are dead to us.

>> for the internal form makes it quick and easy to produce a complete
>> message when called for,
>
> Only for certain kinds of messages, such as automated forwards and
> signed MIME parts, and cron's messages.  For those, there are great
> advantages to spewing things verbatim as you got them off the wire or
> the disk.  But even there, as long as we use the natural embedding of
> bytes in Unicode (ie, interpret bytes as ISO 8859/1) it's easy and not
> particularly inefficient to use strings.
>
> For anything else, storing in wire format is going to require checking
> format (of the stored data if the format is variable, and always of
> the requesting API) on all attribute accesses, and conversions on
> many, even most attribute accesses.

I think that's going to be the case either way.  Some applications are  
going to want bytes, others strings, so there needs to be APIs for both.

> So the question to me is "what are the primary use cases for the email
> module, and how do they affect the choice of internal representation?"
> I can't claim special expertise on "how", I'll leave that up to
> Barry.  Here are some use cases I can think of.
>
> 1.  Debugging programs using the email module.  Maybe that's a +1 for
>    internally storing textual data in string form.
>
> 2.  MUA #1: Composition.  Input will be strings and multimedia file
>    names, output will be bytes.  Will attributes of message objects
>    be manipulated?  Not in a conventional MUA, but an email-based MUA
>    might find uses for that.
>
> 3.  MUA #2: Reading.  Input will often be bytes (spool files, IMAP
>    data).  Could be strings, though, depending on the internal format
>    of folders.  Output will be strings and multimedia objects.  Lots
>    of string processing, especially generating folder directory
>    displays from message headers.
>
> 4.  Mailing list processor.  Message input will be bytes.
>    Configuration input, including heading and footer texts that may
>    be added are likely to be strings.  Header manipulation (adding
>    topics, sequence numbers, RFC 2369 headers) most conveniently done
>    with strings.  Output will be bytes.
>
> 5.  Mailing list archiver.  Input will be bytes or message objects,
>    output will be strings (typically HTML documents or XML
>    fragments).
>
> 6.  Spam/virus detection.  Input may be bytes or message objects.
>    Lots of internal string processing; in most cases the text/* parts
>    need to be converted to strings before grepping; in some cases
>    even images or executables may be reconstituted to look for
>    malware signatures.  Output may be a flag or signal, or the
>    message itself may be edited (typically to provide headers
>    recording degree of spamminess, trace headers, maybe a body
>    heading; in some cases, a new message may be generated with the
>    suspected spam as a message/rfc822 MIME body part).

I think this is a very good list.  The key thing from an application's  
point of view is that sometimes messages are parsed and sometimes they  
are crafted.  When parsed, the raw input can come from a completely  
unknown and untrusted source such as the puking mouth of an MTA.   
Other times it comes from a big blob of string in a doctest.  When  
crafted, it's almost always a program building up a message tree from  
scratch, or possibly the manipulation of an existing message (e.g.  
MIME filter).

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 832 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/email-sig/attachments/20091007/026ce6c0/attachment.pgp>