[Email-SIG] API for Header objects [was: Dropping bytes "support" in json]

Thu Apr 16 21:42:07 CEST 2009

On Thu, 16 Apr 2009 at 14:08, Tony Nelson wrote:
> At 23:02 +1000 04/16/2009, Steven D'Aprano wrote:
>> On Thu, 16 Apr 2009 10:39:52 am Tony Nelson wrote:
>>
>>> I don't want there to be any "str(msg['tag'])" or "bytes(msg['tag'])"
>>> at all, so there would be no loss of consistency.
>>
>> That's ... different.

Indeed.

>>> Messages need
>>> flattening to bytes, but there is no use for converting individual
>>> header fields into bytes or strings, outside of a message.
>>
>> Of course there is. You create each header individually, so you should
>> be able to extract each header individually. Here, for example, is a
>> use-case: I want to send postmaster a copy of the X-Spam-Evidence
>> header so she can see why a particular piece of ham got wrongly flagged
>> as spam, or visa versa:
>>
>> X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(which': 0.03;
>>  'attribute': 0.04; 'objects': 0.04; 'returns': 0.05; 'split':
>>  0.05; ...
>>
>> I need to be able to extract just that one header, and while some
>> applications (mail client?) may choose to give me the entire message as
>> text and expect me to manually hunt for the relevant line and
>> copy-and-paste it, other applications may wish to automatically extract
>> the appropriate header and email it to postmaster at localhost. Or write
>> it to a log file, or whatever. Whatever they do, they probably need it
>> as a string (of characters or bytes), not a binary blob.
>
> This example seems tortured and contrived.  Custom code to extract a single
> header one time to send to someone?  Just hit "reply" and trim it yourself.
> If you must, you can use .get_header('X-Spam-Evidence').flatten().  I doubt
> that anyone would actually do that, outside of a debugging session.
>
> Any automatic process for sending reflected spam should include more of the
> message, using the relevent MIME type message/partial (or message/rfc822).

Have you written a user interface using the email package?  I have.
In that user interface, I most definitely want to turn individual headers
into strings.  Specifically, this is a usenet news reader, and when
presenting messages I want to display _only_ the Date and From headers.
You will note that 'From' is an address header, and in this particular
use case I want to use "str(message['From'])", and I don't care two
hoots that the thing is properly a list of friendly-name address pairs.

That is not a contrived example, that's _production code_ that I
use every day.

Nor is the quoted example all that contrived...after reading it I was
considering if it would be useful to run a program over my incoming mail
to extract the X-Spam-Evidence headers and a couple other headers and
email them to me in a report daily.  It's not useful enough that I'll
write the code, I've too many other priorities, but it's potentially
useful enough (for tuning my spam filters) that I don't consider it a
contrived use case.  And if the spam gets worse I may just come back
to that idea.

>>> Some
>>> header field data /is/ strings, some is lists of address pairs, and
>>> so on.
>>
>> But "lists of address pairs" themselves are strings.
>
> Wrong!  They are *lists* (or at least sequences) of address pairs of
> friendly name, email address.  Just as bytes are not strings, and dicts are
> not strings, and JPEC images, lists are not strings.  For better
> understanding of what an Address is, see RFC 5322 (the current incarnation
> of RFC x822), section 3.4, which describes both the best way and current or
> obsolete practice.

I suspect that most or all of us do understand the RFC.

When Steve says 'but lists of address pairs are themselves strings' I hear
him saying that each element of the pair is a string.  I think you would
have to agree with that.  Unless you want them to remain as byte strings?
Or, as I would prefer, make them into Address objects with appropriate
methods and an appropriate str.  But even then, the friendly name and
address data elements of the Address should be unicode strings.

>>> If the data for a header field is not properly a string,
>>
>> But it always is.
>
> No.  This is important, and you will not understand RFC x822 email until
> you understand this:  email messages are not character strings.  They are
> byte sequences.  This confusion pervades the email package only because in
> Python before 3.x, bytes were represented as strings.

A header always has a string representation, though.  It's the one a
dumb-text UI would present to the user.  IMO the email package needs to
support building such UIs.  The string representation is also useful
for debugging (as is the bytes representation).  I see no reason
it should not be accessible through the normal Python 'str' method.
Why obfuscate access to it?

>> Even badly formatted emails with corrupt headers containing binary
>> characters are strings -- they're just byte (non-Unicode) strings
>> containing binary characters. Your mail server might not accept it as
>> part of a valid header, but it's a valid byte string.
>
> Strings are not bytes.  Sequences of bytes are not strings.  Converting
> between them demands an encoding.  Sometimes the encoding exists, sometimes
> it mostly exists, and sometimes there is no such encoding, as for a JPEG
> image, which is a structured byte sequence.

I agree with you that Unicode strings are not bytes, and that email is
encoded as (ASCII) bytes.

As for the JPEG, sure there's no encoding in the Unicode sense.  There
certainly is an encoding, though: JPEG wrapped up in the appropriate
mime type encoding.

>>> a means to get it as one is wrong.

IMO it is always appropriate to be able to get a header body as a string.
It may not be a meaningful format in which to _manipulate_ the header
body information (which is why I think message's __getitem__ needs
to return a Header object), but it is a legitimate representation for
user consumption.

>> Email *is* text. It's built on top of a restricted range of ASCII bytes,
>> which we can legitimately call "text" because it is a subset of Unicode
>> text. Even if a particular header contains binary data, it must be
>> encoded as ASCII text before it can be placed into the header.
> ...
>
> No, email is not text.  Email message bodies and some header fields may
> represent text.  An email message is a byte sequence.  One really needs to
> understand this in order to work with email at a low level.  When one does
> not understand, then the email package should lead the user in the right
> direction.

You and Steve are defining terms differently here, I think, but other
than that I suspect you are not that far apart on this particular point.

What I want the email package to do is make it easy to pass text in
and have the email package create the syntactically correct bytes
representation to go out on the wire.  I'm visualizing building the
'From' header, for example, something like this:

     message['From'] = AddressHeader(Address('John Smith', 'john at foo.com'))

and have it default to UTF-8 encoding....or maybe the encoding gets
specified when I say message.serialize('utf-8').  But as I said, I
haven't actually written code that builds messages yet.

Note that while I want to be able to do str(someHeader) to get a
string representation of a header body, I'm not so enamored of being
able to do

     message['From'] = 'John Smith <john at foo.com>'

and have it get turned into a Header or AddressHeader object.
Frankly, that looks too magical to me.

--David