[Email-SIG] API for Header objects [was: Dropping bytes "support" in json]

Thu Apr 16 22:44:14 CEST 2009

On approximately 4/16/2009 6:02 AM, came the following characters from 
the keyboard of Steven D'Aprano:
> On Thu, 16 Apr 2009 10:39:52 am Tony Nelson wrote:
>   
>> I don't want there to be any "str(msg['tag'])" or "bytes(msg['tag'])"
>> at all, so there would be no loss of consistency.
>>     
>
> That's ... different.
>   
>> If the data for a header field is not properly a string, 
>>     
> But it always is. 
>
> Even badly formatted emails with corrupt headers containing binary 
> characters are strings -- they're just byte (non-Unicode) strings 
> containing binary characters. Your mail server might not accept it as 
> part of a valid header, but it's a valid byte string.
>   

Wire format email headers are composed of a subset of ASCII text.  There 
should be a way to obtain them, either as bytes, or via the trivial str 
conversion of those bytes to Unicode.  Even corrupt headers containing 
binary characters should be obtainable that way.  There are no header 
encoding or decoding algorithms that cannot be reworked to function 
properly on either the raw_bytes or raw_str version of a header, since 
the numeric values and sequence of all binary octets would be preserved 
via both raw_bytes and raw_str.  *The key is to know what is in hand.*  
For both raw_bytes and raw_str, all characters would be in the range 0 - 
0xFF.  This is simple transliteration, not interpretation or parsing.  A 
non-corrupt header would have a smaller range, 0x20 - 0x7F.  Any header 
should be obtainable or settable in this form, using either bytes or str 
parameters/results.  Yes, it should be possible to create corrupt 
headers in this manner.  Useful mostly for testing, or for idempotency 
(which I also call GIGO).

However, obtaining headers in that way should be "hard", but only the 
sense of having to type more because it is part of a lower level 
interface, not the primary APIs... like  msg['tag'].raw_bytes or 
msg['tag'].raw_str... because it is actually the easiest way 
(implementation-wise) to obtain a copy of the data... but that copy may 
not be as useful as one might like.

str(msg['tag'])  or  msg['tag'].str   (or some such spelling[s]) should 
always produce a displayable form of the header.  If it is a known, 
standardized header that may contained data that was encoded for 
transmission, such encodings should be reversed, and Unicode characters 
outside the range of U+0020 - U+007F may be included.  Remember the goal 
here is "displayable".  So if the encoding is bad for a standard header, 
or a standard header is corrupt, or a non-standard header contains what 
is apparently binary gibberish, and non-displayable Unicode control 
characters are generated, they should be escaped as 7 ASCII characters 
representing a Unicode code point "\U+0017".  All such display strings 
must always have "\" converted to "\\" so that there is no ambiguity 
when interpreting strings that may contain text that looks like one of 
the escape strings.

Known standard headers should have additional APIs (these already exist 
for the most useful ones) to obtain the interesting subcomponents 
(encodings, names, addresses, MIME types, etc.).  These should have str 
parameters and results interfaces only, and specification of an encoding 
can be optional, defaulting to UTF-8 (or possibly defaulting to a 
Message-level encoding specification, which in turn may default to 
UTF-8), overridable in some of the APIs via optional parameters (some, 
because overloaded assignment APIs may not have room for such overrides, 
not having optional parameters).

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking