[Email-SIG] API for Header objects [was: Dropping bytes "support" in json]

Thu Apr 16 15:02:13 CEST 2009

On Thu, 16 Apr 2009 10:39:52 am Tony Nelson wrote:

> I don't want there to be any "str(msg['tag'])" or "bytes(msg['tag'])"
> at all, so there would be no loss of consistency.

That's ... different.

> Messages need 
> flattening to bytes, but there is no use for converting individual
> header fields into bytes or strings, outside of a message.

Of course there is. You create each header individually, so you should 
be able to extract each header individually. Here, for example, is a 
use-case: I want to send postmaster a copy of the X-Spam-Evidence 
header so she can see why a particular piece of ham got wrongly flagged 
as spam, or visa versa:

X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(which': 0.03;
  'attribute': 0.04; 'objects': 0.04; 'returns': 0.05; 'split':
  0.05; ... 

I need to be able to extract just that one header, and while some 
applications (mail client?) may choose to give me the entire message as 
text and expect me to manually hunt for the relevant line and 
copy-and-paste it, other applications may wish to automatically extract 
the appropriate header and email it to postmaster at localhost. Or write 
it to a log file, or whatever. Whatever they do, they probably need it 
as a string (of characters or bytes), not a binary blob.

> Some 
> header field data /is/ strings, some is lists of address pairs, and
> so on.

But "lists of address pairs" themselves are strings.

> If the data for a header field is not properly a string, 

But it always is. 

Even badly formatted emails with corrupt headers containing binary 
characters are strings -- they're just byte (non-Unicode) strings 
containing binary characters. Your mail server might not accept it as 
part of a valid header, but it's a valid byte string.

> a  
> means to get it as one is wrong.

Email *is* text. It's built on top of a restricted range of ASCII bytes, 
which we can legitimately call "text" because it is a subset of Unicode 
text. Even if a particular header contains binary data, it must be 
encoded as ASCII text before it can be placed into the header.

X-Some-Header: \0\0\01\0\xff3G\04 

(where \0 means byte('\0') etc) is not a valid email header -- the 
binary data must be encoded as ASCII text first. So any valid header 
must have a bytes form and a Unicode form (since the restricted range 
of allowed bytes are always valid Unicode as well). Corrupted headers 
may not have a valid Unicode form, but they will always have a byte 
form -- after all, the header eventually must be written to disk in 
some mail box somewhere, and it can only do so as bytes. 

So for any header, there is always a way of writing it in bytes, and 
nearly always a way of writing it as characters. there a valid text 
version of any header. 

Furthermore, in general for arbitrary headers, we can't tell what the 
header means *except* as a text string:

X-Some-Header: AB34F8702D6

We have no way of telling whether the payload "AB34F8702D6" is a string 
of characters meaningful to some application just as they are, or 
whether it is a string encoded from binary data. We might *guess* that 
the encoding *could be* some known encoding (quoted-printable, base64, 
etc) but we can't tell unless it is a known standard header.

> I want to be able to get and put
> the proper type of data for a particular header field, and to be told
> when I did it wrong, rather than just get a corrupt message.

But in general, you can't know what the "proper type of data" is for 
arbitrary headers.

What are valid data for X-policyd-weight headers? What about 
X-Some-Random-Header-I-Just-Made-Up?

-- 
Steven D'Aprano