[Email-SIG] fixing the current email module

Tue Oct 6 11:28:41 CEST 2009

On approximately 10/3/2009 10:09 AM, came the following characters from 
the keyboard of Timothy Farrell:
> I agree with Barry insofar as accepting bytes or strings on the input with internal processing in bytes and output bytes or strings depending on the content parsed.
>
> Forgive my ignorance...why does converting bytes to strings have to be a mess?  Rather than having two Feedparsers, can't we just pass a default encoding when instantiating a feedparser and have it read from the MIME headers otherwise?  If not encoding is passed and one can't be determined, simply output as bytes or try a default and raise an exception if it fails.
>
> If providing the default encoding, no such range check is needed.
>
> ----- Original Message -----
> From: "Stephen J. Turnbull" <stephen at xemacs.org>
> To: "Barry Warsaw" <barry at python.org>
> Cc: "Timothy Farrell" <tfarrell at owassobible.org>, email-sig at python.org
> Sent: Saturday, October 3, 2009 10:41:48 AM GMT -06:00 US/Canada Central
> Subject: Re: [Email-SIG] fixing the current email module
>
> Barry Warsaw writes:
>
>  > So the basic model is: accept strings or bytes at the edges,
>  > process everything internally as bytes, output strings and bytes at
>  > the edges.
>
> In a certain pedantic sense, that can't be right, because bytes alone
> can't represent strings.
>
> Practically, you are going need to say how a bytes or bytearray is to
> be interpreted as a string, and that is going to be one big mess.
> (MIME?)
>
> Going the other way around you have no such problem, or rather the
> trivial embedding works fine, except that you have to do a range check
> at some point before you convert to bytes.

Email messages are bytes.  Usually restricted to bytes in the range 
32-127, but sometimes permitted to be 0-255 (8bit encoding).

Email messages carry sufficient information to convert bytes to strings 
(usually; and sufficient defaults to cover the other cases adequately, 
even if not with 100% certainty).

So if Barry is considering that the internal form is bytes, particularly 
bytes encoded via email RFCs, then I can't argue with that being a 
reasonable internal form.... except for one problem, 2 paragraphs below.

The only mess that I can see Stephen referring to is the fact that the 
email RFCs define rather messy encoding formats and character set 
specifications.  There isn't much cure for this, AFAICS, other than 
perhaps keeping the bytes in segmented structures, with cached metadata 
to speed repeated references.  Using any other format than email format, 
means knowing how to translate that format to/from email format, and 
to/from API format... this means coding two translation routines instead 
of one.

The choice of email RFC byte formats for the internal form makes it 
quick and easy to produce a complete message when called for, and to 
defer interpretation when a message is fed in.... sometimes, and herein 
lies the catch....

One problem with storing messages in bytes format: it seems to me that 
the choice of which of several legal email bytes formats to represent 
various email parts (texts and attachments) is problematical for using 
email format bytes as the internal storage format.  An unsophisticated 
email library could assume that the transfer encoding is always 7bit, 
and that should be acceptable in all circumstances.  A more 
sophisticated email library would provide support for either 7bit or 
8bit transfer encodings.... but the choice of the bytes formats, and 
MIME type encodings of various message parts to support that difference 
would be significant.  It seems that the present email lib provides only 
a way to create only a 7bit or 8bit message (and apparently not binary 
encoding), meaning that the whole message assembly process has to be 
done after initiating a connection with the SMTP server, to determine 
whether it supports 8bit (or binary) encoding or not.  A more abstract 
internal format could defer that choice to the generate step, keeping 
items as str or binary blobs prior to that step.

IIUC, 7bit requires that text and binary be encoded to remove 
"difficult" byte values from the byte stream, so choosing quopri or 
base64 is appropriate at MIME part definition time to make that choice 
(although an optimal sized choice could be made based on the data), in 
the event that generate requests 7bit.

However, 8bit has no such requirement, it declares that there are no 
difficult characters except NULL, CR and LF.  However, because no 8bit 
encodings are defined, the (inefficient, 7-bit) quopri or base64 may 
still have to be used to avoid lines that are too long, and to encode 
NULL, CR and LF.  8 bit and UTF-8 text containing no NULL characters and 
no long lines would qualify without encoding.

Finally, binary declares that there are no difficult characters at all.  
Therefore, the quopri or base64 choice could be ignored, and the raw 
data passed through.

Choosing a particular Content-Transport-Encoding as the internal storage 
format forces transcoding to the other Content-Transport-Encoding values 
on the fly after connecting to the SMTP server (using an apparently 
non-existent parameter to the generate method); not supporting 
on-the-fly transcoding would force the user to choose a particular 
Content-Ttransport-Encoding up front, requiring the connection to the 
SMTP server even earlier in the process.

I observe that most of my SMTP providers do not support binary 
transport, but it seems that MS Exchange does.

I observe that binary transport is more efficient than 7bit or 8bit.

I observe that even with binary transport, the MIME headers must still 
be in US-ASCII, by definition, so the headers need not be generated 
differently for different transports... only the 
Content-Transfer-Encoding, and the content itself, would be affected by 
deferring that choice to generate time.

Perhaps binary transport, with meta-data indicating whether the user 
prefers quopri or base64 for parts that must be encoded for 7bit or 8bit 
transport, would be an appropriate storage format for the email 
library.  This would allow the quopri or base64 encodings to be 
performed on-the-fly, only if needed, by adding a new parameter to 
generate, that specifies the Content-Transfer-Encoding (which should 
default to 7bit for maximal server compatibility, or 8bit if the user 
specified that along the way so that backwards compatibility is preserved).

N.B.  I note that the documentation for 2.6.3 section 19.1.3 MIMEtext 
function (reproduced below) is confusing:

/class /email.mime.text.MIMEText(/_text/[, /_subtype/[, /_charset/]])¶ 
<http://docs.python.org/library/email.mime.html#email.mime.text.MIMEText>

    Module: email.mime.text

    A subclass of MIMENonMultipart
    <http://docs.python.org/library/email.mime.html#email.mime.nonmultipart.MIMENonMultipart>,
    the MIMEText
    <http://docs.python.org/library/email.mime.html#email.mime.text.MIMEText>
    class is used to create MIME objects of major type /text/. /_text/
    is the string for the payload. /_subtype/ is the minor type and
    defaults to /plain/. /_charset/ is the character set of the text and
    is passed as a parameter to the MIMENonMultipart
    <http://docs.python.org/library/email.mime.html#email.mime.nonmultipart.MIMENonMultipart>
    constructor; it defaults to us-ascii. No guessing or encoding is
    performed on the text data.

    Changed in version 2.4: The previously deprecated /_encoding/
    argument has been removed. Encoding happens implicitly based on the
    /_charset/ argument.

The confusion is that it states there is no encoding performed, and then 
it states that encoding is implicit.  It is not clear what it actually 
does, if anything.  The 3.2a0 documentation further muddies the water by 
removing the last paragraph.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking