[Email-SIG] API for Header objects [was: Dropping bytes "support" in json]
Steven D'Aprano
steve at pearwood.info
Thu Apr 16 15:02:13 CEST 2009
On Thu, 16 Apr 2009 10:39:52 am Tony Nelson wrote:
> I don't want there to be any "str(msg['tag'])" or "bytes(msg['tag'])"
> at all, so there would be no loss of consistency.
That's ... different.
> Messages need
> flattening to bytes, but there is no use for converting individual
> header fields into bytes or strings, outside of a message.
Of course there is. You create each header individually, so you should
be able to extract each header individually. Here, for example, is a
use-case: I want to send postmaster a copy of the X-Spam-Evidence
header so she can see why a particular piece of ham got wrongly flagged
as spam, or visa versa:
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(which': 0.03;
'attribute': 0.04; 'objects': 0.04; 'returns': 0.05; 'split':
0.05; ...
I need to be able to extract just that one header, and while some
applications (mail client?) may choose to give me the entire message as
text and expect me to manually hunt for the relevant line and
copy-and-paste it, other applications may wish to automatically extract
the appropriate header and email it to postmaster at localhost. Or write
it to a log file, or whatever. Whatever they do, they probably need it
as a string (of characters or bytes), not a binary blob.
> Some
> header field data /is/ strings, some is lists of address pairs, and
> so on.
But "lists of address pairs" themselves are strings.
> If the data for a header field is not properly a string,
But it always is.
Even badly formatted emails with corrupt headers containing binary
characters are strings -- they're just byte (non-Unicode) strings
containing binary characters. Your mail server might not accept it as
part of a valid header, but it's a valid byte string.
> a
> means to get it as one is wrong.
Email *is* text. It's built on top of a restricted range of ASCII bytes,
which we can legitimately call "text" because it is a subset of Unicode
text. Even if a particular header contains binary data, it must be
encoded as ASCII text before it can be placed into the header.
X-Some-Header: \0\0\01\0\xff3G\04
(where \0 means byte('\0') etc) is not a valid email header -- the
binary data must be encoded as ASCII text first. So any valid header
must have a bytes form and a Unicode form (since the restricted range
of allowed bytes are always valid Unicode as well). Corrupted headers
may not have a valid Unicode form, but they will always have a byte
form -- after all, the header eventually must be written to disk in
some mail box somewhere, and it can only do so as bytes.
So for any header, there is always a way of writing it in bytes, and
nearly always a way of writing it as characters. there a valid text
version of any header.
Furthermore, in general for arbitrary headers, we can't tell what the
header means *except* as a text string:
X-Some-Header: AB34F8702D6
We have no way of telling whether the payload "AB34F8702D6" is a string
of characters meaningful to some application just as they are, or
whether it is a string encoded from binary data. We might *guess* that
the encoding *could be* some known encoding (quoted-printable, base64,
etc) but we can't tell unless it is a known standard header.
> I want to be able to get and put
> the proper type of data for a particular header field, and to be told
> when I did it wrong, rather than just get a corrupt message.
But in general, you can't know what the "proper type of data" is for
arbitrary headers.
What are valid data for X-policyd-weight headers? What about
X-Some-Random-Header-I-Just-Made-Up?
--
Steven D'Aprano
More information about the Email-SIG
mailing list