[Email-SIG] API for Header objects [was: Dropping bytes "support" in json]
Glenn Linderman
v+python at g.nevcal.com
Thu Apr 16 22:44:14 CEST 2009
On approximately 4/16/2009 6:02 AM, came the following characters from
the keyboard of Steven D'Aprano:
> On Thu, 16 Apr 2009 10:39:52 am Tony Nelson wrote:
>
>> I don't want there to be any "str(msg['tag'])" or "bytes(msg['tag'])"
>> at all, so there would be no loss of consistency.
>>
>
> That's ... different.
>
>> If the data for a header field is not properly a string,
>>
> But it always is.
>
> Even badly formatted emails with corrupt headers containing binary
> characters are strings -- they're just byte (non-Unicode) strings
> containing binary characters. Your mail server might not accept it as
> part of a valid header, but it's a valid byte string.
>
Wire format email headers are composed of a subset of ASCII text. There
should be a way to obtain them, either as bytes, or via the trivial str
conversion of those bytes to Unicode. Even corrupt headers containing
binary characters should be obtainable that way. There are no header
encoding or decoding algorithms that cannot be reworked to function
properly on either the raw_bytes or raw_str version of a header, since
the numeric values and sequence of all binary octets would be preserved
via both raw_bytes and raw_str. *The key is to know what is in hand.*
For both raw_bytes and raw_str, all characters would be in the range 0 -
0xFF. This is simple transliteration, not interpretation or parsing. A
non-corrupt header would have a smaller range, 0x20 - 0x7F. Any header
should be obtainable or settable in this form, using either bytes or str
parameters/results. Yes, it should be possible to create corrupt
headers in this manner. Useful mostly for testing, or for idempotency
(which I also call GIGO).
However, obtaining headers in that way should be "hard", but only the
sense of having to type more because it is part of a lower level
interface, not the primary APIs... like msg['tag'].raw_bytes or
msg['tag'].raw_str... because it is actually the easiest way
(implementation-wise) to obtain a copy of the data... but that copy may
not be as useful as one might like.
str(msg['tag']) or msg['tag'].str (or some such spelling[s]) should
always produce a displayable form of the header. If it is a known,
standardized header that may contained data that was encoded for
transmission, such encodings should be reversed, and Unicode characters
outside the range of U+0020 - U+007F may be included. Remember the goal
here is "displayable". So if the encoding is bad for a standard header,
or a standard header is corrupt, or a non-standard header contains what
is apparently binary gibberish, and non-displayable Unicode control
characters are generated, they should be escaped as 7 ASCII characters
representing a Unicode code point "\U+0017". All such display strings
must always have "\" converted to "\\" so that there is no ambiguity
when interpreting strings that may contain text that looks like one of
the escape strings.
Known standard headers should have additional APIs (these already exist
for the most useful ones) to obtain the interesting subcomponents
(encodings, names, addresses, MIME types, etc.). These should have str
parameters and results interfaces only, and specification of an encoding
can be optional, defaulting to UTF-8 (or possibly defaulting to a
Message-level encoding specification, which in turn may default to
UTF-8), overridable in some of the APIs via optional parameters (some,
because overloaded assignment APIs may not have room for such overrides,
not having optional parameters).
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Email-SIG
mailing list