[Python-Dev] Completing the email6 API changes.

Mon Sep 2 00:10:39 CEST 2013

On Sat, 31 Aug 2013 18:57:56 +0900, "Stephen J. Turnbull" <stephen at xemacs.org> wrote:
> R. David Murray writes:
> 
>  > But I would certainly appreciate review from anyone so moved, since I
>  > haven't gotten any yet.
> 
> I'll try to make time for a serious (but obviously partial) review by
> Monday.
> 
> I don't know if this is "serious" bikeshedding, but I have a comment
> or two on the example:
> 
>  >     from email.message import MIMEMessage
>  >     from email.headerregistry import Address
>  >     fullmsg = MIMEMessage()
>  >     fullmsg['To'] = Address('FoÅ‘ Bar', 'fbar at example.com')
>  >     fullmsg['From'] = "mÃ©Â¨ <me at example.com>"
>  >     fullmsg['Subject'] = "j'ai un problÃ©me de python."
> 
> This is very nice!  *I* *love* it.
> 
> But (sorry!) I worry that it's not obvious to "naive" users.  Maybe it
> would be useful to have a Message() factory which has one semantic
> difference from MIMEMessage: it "requires" RFC 822-required headers
> (or optionally RFC 1036 for news).  Eg:
> 
>     # This message will be posted and mailed
>     # These would conform to the latest Draft Standards
>     # and be DKIM-signed
>     fullmsg = Message('rfc822', 'rfc1036', 'dmarc')
> 
> I'm not sure how "required" would be implemented (perhaps through a
> .validate() method).  So the signature of the API suggested above is
> Message(*validators, **kw).

Adding new constructor arguments to the existing Message class is
possible.  However, given the new architecture, the more logical way
to do this is to put it in the policy.  So currently the idea would be
for this to be spelled like this:

    fullmsg = Message(policy=policy.SMTP+policy.strict)

Then what would happen is that when the message is serialized (be it
via str(), bytes(), by passing it to smtplib.sendmail or
smtplib.sendmessage, or by an explicit call to a Generator), an
error would be raised if the minimum required headers are not
present.

As I said in an earlier message, currently there's no extensibility
mechanism for the validation.  If the parser recognizes a defect, whether
or not an error is raised is controlled by the policy.  But there's
no mechanism for adding new defect checking that the parser doesn't
already know about, or for issues that are not parse-time defects.
(There is currently one non-parsing defect for which there is a custom
control: the maximum number of headers of a given type that are allowed
to be added to a Message object.)

So we need some way to add additional constraints as well.  Probably a
list of validation functions that take a Message/MIMEPart as the
argument and do a raise if they want to reject the message.

The tricky bit is that currently raise_on_defect means you get an error
as soon as a (parsing) defect is discovered.  Likewise, if max_count
is being enforced for headers, the error is raised as soon as the
duplicate header is added.

Generating errors early when building messages was one of or original
design goals, and *only* detecting problems via validators runs counter
to that unless all the validators are called every time an operation
is performed that modifies a message.  Maybe that would be OK, but it
feels ugly.

For the missing header problem, the custom solution could be to add a
'headers' argument to Message that would allow you to write:

     fullmsg = Message(header=(
                        Header('Date', email.utils.localtime()),
                        Header('To', Address('Fred', 'abc at xyz.com')),
                        Header('From', Address('Sally, 'fgd at xyz.com')),
                        Header('Subject', 'Foo'),
                        ),
                    policy=policy.SMTP+policy.Strict)

This call could then immediately raise an error if not all of the
required headers are present.  (Header is unfortunately not a good
choice of name here because we already have a 'Header' class that has a
different API).

Aside: I could also imagine adding a 'content' argument that would let
you generate a simple text message via a single call...which means you
could also extend this model to specifying the entire message in a single
call, if you wrote a suitable content manager function for tuples:

     fullmsg = Message(
                policy=policy.SMTP+policy.Strict,
                header=(
                   Header('Date', datetime.datetime.now()),
                   Header('To', Address('Fred', 'abc at xyz.com')),
                   Header('From', Address('Sally, 'fgd at xyz.com')),
                   Header('Subject', 'Foo'),
                   ),
                content=(
                        (
                            'This is the text part',
                            (
                              '<p>Here is the html</p><img src="image1" \>',
                              {'image1': b'image data'},
                              ),
                        ),
                    b'attachment data',
                    )

But that is probably a little bit crazy...easier to just write a custom
function for your application that calls the message building APIs.

Well, anyway, coming back from that little tangent, it seems to me in
summary that if we want to raise errors as soon as possible, we need
to add custom mechanisms for the error detection as we figure out each
class of error we want to handle, so that the validation gets done right
away without adding too much overhead.  For parsing defects, that means
if you want to control which ones are raised you hook handle_defect
and make your decision there (there's no way to *add* parsing defects
via that hook).  For duplicate headers, you hook header_max_count (or
set max_count on your custom header classes).  For missing headers we
*could* introduce something like the above headers argument to Message,
which would be required in 'strict' mode.

But clearly we should also have a general "validation function" list on
the policy, where all the functions would be called before serialization
and have an opportunity to reject the message by raising errors.
That would provide a generalized, if heavier handed, hook to use when
there is no specific hook that provides extendability.

For missing headers I'm inclined to use a serialization-time check
rather than the Message constructor check I speculated about above.
My logic is that the message isn't complete until you are ready to send
it, so generating an error earlier probably isn't wanted in many cases,
nor will generating it at serialization time loose you much debugging
information.  So I'm inclined to implement it as a default validator in
the proposed list of serialization time validators.

It is still probably worthwhile providing a utility function for creating
a message in a single call.  I think there's even an open issue in the
tracker for that.  But I'm inclined to postpone that until 3.5.

> For MIMEMessage, I think I prefer the name "MIMEPart".  To naive
> users, the idea of MIMEMessages containing MIMEMessages is a bit
> disconcerting, except in the case of multipart/digest, I think.

Or message/rfc822:

    [...]
    try:
        smtplib.sendmessage(msg)
    except Exception as exc:
        errormsg = MIMEMEssage()
        errormsg['To'] = msg['sender'] or msg['from']
        errormsg['From'] = 'robot at mydomain.org'
        errormsg.set_content("I'm sorry, sending failed: {}".format(exc))
        errormsg.add_attachment(orig_msg, disposition='inline')
        smtplib.sendmessage(errormsg)

(Terrible code, but you get the idea :)

>  >     fullmsg.set_content("et la il est montÃ© sur moi et il commence"
>  >                        " a m'ÃƒÂ©touffer.")
>  >     htmlmsg = MIMEMessage()
>  >     htmlmsg.set_content("<p>et la il est montÃ© sur moi et il commence"
>  >                         " a m'Ã©touffer.</p><img src='image1' />",
>  >                         subtype='html')
> 
> I think I'd like to express the suite above as
> 
>     fullmsg.payload.add_alternative(...)
>     fullmsg.payload.add_alternative(..., subtype='html')
> 
> This would automatically convert the MIME type of fullmsg to
> 'multipart/alternative', and .payload to a list where necessary.
> .set_content() would be available but it's "dangerous" (it could
> replace an arbitrary multipart -- this would be useful operation to
> replace it with a textual URL or external-body part).

Having an attribute with methods that affect the parent object is not a
natural act in Python.  (Is it in OO in general?)  But aside from that,
I'm not sure I see the point of a two level name here.  You are still
talking about mutating the message model object, which is exactly what
is going on in my example: if you call add_alternative on a text part,
it gets turned into a multipart/alternative with the first part being
the text part and the second part being the thing you just added.
That's in the docs but I didn't refer to it explicitly in my example.
Does that make things clearer?

It is a good point about unintentionally overwriting the existing
contents.  That would be simple to fix: make it an error to call
set_content if payload is not None.  Possibly a 'clear' method would
then be useful, especially since so many other Python datatypes have one.

> Aside: it occurs to me that the .payload attribute (and other such
> attributes) could be avoided by the device of using names prefixed by
> ":" such as ":payload" as keys: "fullmsg[':payload']" since colon is
> illegal in field names (cf RFC 5322).  Probably I've just been writing
> too much Common Lisp, though.<wink/>

I think maybe so :).  I'd rather write 'msg.payload' than 'msg[:payload'].
So I don't have a motivation to "avoid" those attributes.  (Nor would
I find msg[':payload'].sommethod mutating the parent object any less
surprising.)

> I'm not sure whether "payload" is a better name than "content" for
> that attribute.
> 
> Now the suite
> 
>  >     with open('python.jpg', 'rb') as python:
>  >         htmlmsg.add_related(python.read(), 'image', 'jpg', cid='image1'
>  >                             disposition='inline')
>  >     fullmsg.make_alternative()
>  >     fullmsg.attach(htmlmsg)
> 
> becomes just
> 
>     with open('python.jpg', 'rb') as python:
>         fullmsg.payload['text/html'].add_related(...)

This doesn't work, though, because you could (although you usually
won't) have more than one 'text/html' part in a single multipart.

The reason behind the structure of my example is to avoid having
to think about indexing into already existing parts.  It *could*
have been written:

    fullmsg.add_alternative(...)
    fullmsg.add_alternative(...., subtype='html')
    with open('python.jpg', 'rb') as python:
        fullmsg.get_payload()[1].add_related(...)

But that means you have to think about payload lists and order of
parts and figure out the right index to use.  But perhaps some people
will find that more congenial than the make_alternative() dance, and I
should probably make sure that I include that alternative in the examples
(that I haven't yet written for the docs).  (I dislike the 'get_payload'
API, by the way.)

The awkwardness of the code in my example arises from the fact that
a Message object is a valid thing to want to use as content (see
message/rfc822 above), so I can't have the code just assume that:

    fullmsg.add_alternative(htmlmsg)

means "make fullmsg into a multipart/alternative and attach htmlmsg",
since in the logic of my API what it naturally means is "make_fullmsg
into a multipart/alternative and attach a new message/rfc822 part
containing htmlmsg".  Thus the make/attach dance instead.

However, what I really want, as I mentioned in my original proposal but
have dropped from my 3.4 proposed code addition, is:

    with open('python.jpg', 'rb') as python:
        fullmsg.add_alternative(
            Webpage('<p>example<\p><img src=image1 \>',
                    {'image1': python.read()}
                   ))

I dropped it from the code proposal because it seems to me that we
should give the community an opportunity to experiment with the content
manager interface before we decide what the best stuff is to include in
the stdlib.

> At this point, "fullmsg.add_related()" without the .payload attribute
> would be an error, unless a "insertPart=True" keyword argument were
> present.  With "insertPart=True", a new top-level multipart/related
> would be interposed with the existing multipart/alternative as its
> first child, and the argument of add_related as the second.  Maybe
> that's too complicated, but I suspect it's harder for people who think
> of MIME messages as trees, than for people who think of messages as
> documents and don't want to hear about mimes other than Marcel
> Marceau.<wink/>

I *think* I see what you are suggesting, but I don't see that it is
easier for the non-tree thinker than my proposal.  My proposal doesn't
even let you make the conversion your insertPart=True would produce.
(That could indeed be a bug, so we may want a "I know what I'm doing"
keyword for 'make_related' and friends).

> The indexing of the .payload attribute by part type is perhaps too
> smart for my own good, haven't thought carefully about it.  It's
> plausible, though, since a message with multiple parts of the same
> type can only have one displayed -- normally that shouldn't happen.
> OTOH, this wouldn't work without modification for multipart/mixed or
> multipart/related.  Could use Yet Another Keyword Argument, maybe.

Ah, good point.  But the mechanism doesn't generalize to other types (eg:
you can have any number of 'image/jpg'), which would make the API quite
inconsistent (you can index by 'text/html' but not by 'image/jpeg'? Wat?)

> (BTW, it's really annoying when the text/plain part refers to images
> etc that are attached only to the text/html part.  AFAICT from RFC
> 2387 it ought to be possible to put the multipart/related part at the
> top so both text/html and text/plain can refer to it.)

Is there a text part format that can actually refer to related parts?
text/plain isn't one, as far as I know.  As I said in another message, it
is worth thinking about supporting more unusual structures.  The tradeoff
is not getting an error when you do something that is in most cases
a mistake.  Absent information about a reasonably common text format
that actually embeds images, I'm inclined to keep the restriction.
We can always remove it later, but we can't add it back later.

>  >     with open('police-report.txt') as report:
>  >         fullmsg.add_attachment(report.read(), filename='pÃƒÂ¶lice-report.txt',
>  >                                params=dict(wrap='flow'), headers=(
>  >                                     'X-Secret-Level: top',
>  >                                     'X-Authorization: Monty'))
> 
> I can't find an RFC that specifies a "wrap" parameter to text/plain.
> Do you mean RFC 2646 'format="flowed"' here?

Yes, I did.  I just didn't bother to look it up (my apologies), since the
point of the example was that you could add arbitrary extra parameters.

> (A "validate" method could raise a warning on unregistered parameters.)

Yes, which if we want it to happen at the time the parameter is set, means
we probably want a custom (extendible) mechanism for this in the policy.

>  > (Hmm.  Looking at that I see I didn't fully fix a bug I had meant to fix:
>  > some of the parts have a MIME-Version header that don't need it.)
> 
> Another reason why the top-level part should be treated differently in
> the API.

True.  And if we add MIMEPart but keep MIMEMessage as the top level part
rather than re-using the existing Message, we can make the MIMEMessage
policy strict by default.

--David