[Email-SIG] fixing the current email module
Glenn Linderman
v+python at g.nevcal.com
Tue Oct 6 11:28:41 CEST 2009
On approximately 10/3/2009 10:09 AM, came the following characters from
the keyboard of Timothy Farrell:
> I agree with Barry insofar as accepting bytes or strings on the input with internal processing in bytes and output bytes or strings depending on the content parsed.
>
> Forgive my ignorance...why does converting bytes to strings have to be a mess? Rather than having two Feedparsers, can't we just pass a default encoding when instantiating a feedparser and have it read from the MIME headers otherwise? If not encoding is passed and one can't be determined, simply output as bytes or try a default and raise an exception if it fails.
>
> If providing the default encoding, no such range check is needed.
>
> ----- Original Message -----
> From: "Stephen J. Turnbull" <stephen at xemacs.org>
> To: "Barry Warsaw" <barry at python.org>
> Cc: "Timothy Farrell" <tfarrell at owassobible.org>, email-sig at python.org
> Sent: Saturday, October 3, 2009 10:41:48 AM GMT -06:00 US/Canada Central
> Subject: Re: [Email-SIG] fixing the current email module
>
> Barry Warsaw writes:
>
> > So the basic model is: accept strings or bytes at the edges,
> > process everything internally as bytes, output strings and bytes at
> > the edges.
>
> In a certain pedantic sense, that can't be right, because bytes alone
> can't represent strings.
>
> Practically, you are going need to say how a bytes or bytearray is to
> be interpreted as a string, and that is going to be one big mess.
> (MIME?)
>
> Going the other way around you have no such problem, or rather the
> trivial embedding works fine, except that you have to do a range check
> at some point before you convert to bytes.
Email messages are bytes. Usually restricted to bytes in the range
32-127, but sometimes permitted to be 0-255 (8bit encoding).
Email messages carry sufficient information to convert bytes to strings
(usually; and sufficient defaults to cover the other cases adequately,
even if not with 100% certainty).
So if Barry is considering that the internal form is bytes, particularly
bytes encoded via email RFCs, then I can't argue with that being a
reasonable internal form.... except for one problem, 2 paragraphs below.
The only mess that I can see Stephen referring to is the fact that the
email RFCs define rather messy encoding formats and character set
specifications. There isn't much cure for this, AFAICS, other than
perhaps keeping the bytes in segmented structures, with cached metadata
to speed repeated references. Using any other format than email format,
means knowing how to translate that format to/from email format, and
to/from API format... this means coding two translation routines instead
of one.
The choice of email RFC byte formats for the internal form makes it
quick and easy to produce a complete message when called for, and to
defer interpretation when a message is fed in.... sometimes, and herein
lies the catch....
One problem with storing messages in bytes format: it seems to me that
the choice of which of several legal email bytes formats to represent
various email parts (texts and attachments) is problematical for using
email format bytes as the internal storage format. An unsophisticated
email library could assume that the transfer encoding is always 7bit,
and that should be acceptable in all circumstances. A more
sophisticated email library would provide support for either 7bit or
8bit transfer encodings.... but the choice of the bytes formats, and
MIME type encodings of various message parts to support that difference
would be significant. It seems that the present email lib provides only
a way to create only a 7bit or 8bit message (and apparently not binary
encoding), meaning that the whole message assembly process has to be
done after initiating a connection with the SMTP server, to determine
whether it supports 8bit (or binary) encoding or not. A more abstract
internal format could defer that choice to the generate step, keeping
items as str or binary blobs prior to that step.
IIUC, 7bit requires that text and binary be encoded to remove
"difficult" byte values from the byte stream, so choosing quopri or
base64 is appropriate at MIME part definition time to make that choice
(although an optimal sized choice could be made based on the data), in
the event that generate requests 7bit.
However, 8bit has no such requirement, it declares that there are no
difficult characters except NULL, CR and LF. However, because no 8bit
encodings are defined, the (inefficient, 7-bit) quopri or base64 may
still have to be used to avoid lines that are too long, and to encode
NULL, CR and LF. 8 bit and UTF-8 text containing no NULL characters and
no long lines would qualify without encoding.
Finally, binary declares that there are no difficult characters at all.
Therefore, the quopri or base64 choice could be ignored, and the raw
data passed through.
Choosing a particular Content-Transport-Encoding as the internal storage
format forces transcoding to the other Content-Transport-Encoding values
on the fly after connecting to the SMTP server (using an apparently
non-existent parameter to the generate method); not supporting
on-the-fly transcoding would force the user to choose a particular
Content-Ttransport-Encoding up front, requiring the connection to the
SMTP server even earlier in the process.
I observe that most of my SMTP providers do not support binary
transport, but it seems that MS Exchange does.
I observe that binary transport is more efficient than 7bit or 8bit.
I observe that even with binary transport, the MIME headers must still
be in US-ASCII, by definition, so the headers need not be generated
differently for different transports... only the
Content-Transfer-Encoding, and the content itself, would be affected by
deferring that choice to generate time.
Perhaps binary transport, with meta-data indicating whether the user
prefers quopri or base64 for parts that must be encoded for 7bit or 8bit
transport, would be an appropriate storage format for the email
library. This would allow the quopri or base64 encodings to be
performed on-the-fly, only if needed, by adding a new parameter to
generate, that specifies the Content-Transfer-Encoding (which should
default to 7bit for maximal server compatibility, or 8bit if the user
specified that along the way so that backwards compatibility is preserved).
N.B. I note that the documentation for 2.6.3 section 19.1.3 MIMEtext
function (reproduced below) is confusing:
/class /email.mime.text.MIMEText(/_text/[, /_subtype/[, /_charset/]])¶
<http://docs.python.org/library/email.mime.html#email.mime.text.MIMEText>
Module: email.mime.text
A subclass of MIMENonMultipart
<http://docs.python.org/library/email.mime.html#email.mime.nonmultipart.MIMENonMultipart>,
the MIMEText
<http://docs.python.org/library/email.mime.html#email.mime.text.MIMEText>
class is used to create MIME objects of major type /text/. /_text/
is the string for the payload. /_subtype/ is the minor type and
defaults to /plain/. /_charset/ is the character set of the text and
is passed as a parameter to the MIMENonMultipart
<http://docs.python.org/library/email.mime.html#email.mime.nonmultipart.MIMENonMultipart>
constructor; it defaults to us-ascii. No guessing or encoding is
performed on the text data.
Changed in version 2.4: The previously deprecated /_encoding/
argument has been removed. Encoding happens implicitly based on the
/_charset/ argument.
The confusion is that it states there is no encoding performed, and then
it states that encoding is implicit. It is not clear what it actually
does, if anything. The 3.2a0 documentation further muddies the water by
removing the last paragraph.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Email-SIG
mailing list