[Email-SIG] fixing the current email module
Glenn Linderman
v+python at g.nevcal.com
Fri Oct 9 21:40:33 CEST 2009
On approximately 10/9/2009 5:23 AM, came the following characters from
the keyboard of Barry Warsaw:
> On Oct 8, 2009, at 6:50 PM, Glenn Linderman wrote:
>
>> On approximately 10/8/2009 4:40 AM, came the following characters
>> from the keyboard of Stephen J. Turnbull:
>>> Glenn Linderman writes:
>>>
>>> > > > If conversions are avoided, then octets are unlikely to be
>>> out of > > > range?
>>> > >
>>> > > Haven't looked in your spam bucket recently, I guess. Spammers
>>> > > regularly put 8 bit characters into headers (and into bodies in
>>> > > messages without a Content-Type header), for one thing.
>>> > > I'm aware of that, but if conversions are not done, octets are
>>> unlikely > to be _reported_ to be out of range....
>>>
>>> Conversions will eventually be done. "Best it were done quickly."
>>>
>>
>> Disagree. Deferring the conversions defers failure issues to the
>> point where the code (hopefully) somewhat understands the type of
>> data being manipulated, and can then handle it appropriately.
>> Converting up front causes errors in things that may never be touched
>> or needed, so the error detection and handling is wasteful.
>
> I'm with Stephen here. Remember, we're saying the parser should never
> throw an exception, so any such conversion exception happens when you
> manipulate the model directly. That /has/ to error early because
> otherwise it is impossible to debug.
I suspect we are talking with different terminology somehow, here. At
least it seems that way, between myself and Stephen. So let me return
to ground zero, and ask some very basic questions, to see what, if
anything, I am missing in my understanding of Stephen's and perhaps
your, terminology.
Let me speak in terms of parsing incoming wire-format messages, because
the creation of a valid email from API calls should be straightforward.
I see the necessary job of the parser to received chunks of the message,
parse the headers into individual headers (based mostly on CR LF TAB
detection, and find the end of the headers. Then, in order to properly
handle the body, it needs to find several specific headers, or supply
defaults for them if lacking. They include validation of the
MIME-Version, determining the Content-Type, and
Content-Transfer-Encoding. Other headers do not need to be decoded at
parse time, if I understand things, just parsed into buckets (a list to
preserve order, with possibly an index of some sort for performance if
necessary). The 3 headers mentioned should be fully validated and
decoded, so that parsing the body can proceed. Parsing the body finds
one or more MIME parts, and for each part, a list of its headers should
be created. Content-Type and Content-Transfer-Encoding should again be
fully validated and decoded, so that parsing the body of each part can
proceed recursively. The leaf MIME parts should have their wire format
data stored also.
Do you agree with that minimal functionality of message parsing?
If content boundaries cannot be found, then the parsing will fail, and a
defect report generated for that part, and any higher-level parts that
include it, because they will also be incomplete. That is just a
parse-error flag, in the tree of MIME parts, AFAICT.
I see the further validation and decoding of the MIME tree for the
message to be all based on API calls by the application to manipulate
the model, which should be able to raise exceptions as needed, and could
have fully Pythonic interfaces.
If the client wishes to have all headers, header values, and charset
decoding validated before doing model manipulations, then it should call
email package APIs that are provided to do that individually, per MIME
part, or recursively over the model (and which might raise exceptions).
If the client wishes to have all leaf MIME parts decoded from wire
format to "raw payload" or "decoded payload", before manipulating the
model, then it should call the email package APIs that are provided to
do that individually, per MIME part, or recursively over the model (and
which might raise exceptions).
Is there any other functionality that should be performed? If so, why?
It seems that Stephen is perhaps saying that the functionality in the
above two paragraphs should be performed during parsing. Is that what is
being said? I can hardly believe it, if so. Since there are multiple
ways to interpret not-quite-perfect data, application guidance is
required for those choices, and the creation of defect reports along the
way would be a bookkeeping headache.
>> So for headers, which are supposed to be ASCII, or encoded via RFC
>> rules to ASCII (no 8-bit chars), then the discovery of an 8-bit char
>> should be produce a defect report, but then simply converted to
>> Unicode as if it were Latin-1 (since there is no other knowledge
>> available that could produce a better conversion). And if the result
>> of that is not expected by the client (your definition), then the
>> client should either notice the defect report and reject it based on
>> that, or attempt to parse it, and reject it if it encounters
>> unexpected syntax. After all, this is, for that client, "raw user
>> input" (albeit from a remote source) so fully error checking the
>> input is appropriate.
>
> Sure, but I can also think of lots of other things the client might
> do, including blowing away the header value and substituting their
> own, doing the moral equivalent of a str.replace(), etc. etc. It's
> not our job to decide. It our job to provide the highest fidelity
> information we can and the best APIs for clients to do what they want.
Exactly. So if the client is going to blow away the header value, no
point to validate and decode it.
If the client is going to send it on, the client can choose to validate
before sending, or just send what was received, whether or not it was
valid. This depends on the purpose and functionality of the client.
>> The problem with the APIs that are spelled __str__ and __bytes__ is
>> that there is no other way to return errors other than exceptions....
>> the Python way. Since the email library is trying to avoid raising
>> exceptions in large blocks of its code, it is non-Pythonic (which is
>> what Oleg is probably complaining about, in part). But because it
>> needs to avoid exceptions, and is therefore non-Pythonic, it may be
>> inappropriate to spell very many of its APIs __str__ and __bytes__,
>> because that is Pythonic, and requires exceptions. Once you become
>> non-Pythonic in one area, you may have to also be non-Pythonic in
>> some other areas...
>
> As was pointed out in a previous message, we shouldn't be too
> concerned with __str__ and __bytes__ right now. We'll design
> non-magical APIs for everything and they'll do the right thing. We'll
> then alias what seems appropriate as __str__ and __bytes__ and they'll
> be as Pythonic as makes sense. When I say that, I'm thinking about
> the semantic differences Message objects currently have in their
> dict-like-plus API (which I still think makes perfect practical sense).
OK, it seems we all understand the limitations of the __str__,
__bytes__, and assignment type APIs: they must either succeed, or raise
exceptions. Can we agree to that clients should only use such APIs when
success is assured, or raising exceptions is acceptable? And that if a
client complains about an exception in a case they thought success
should have been assured, that it is not a bug if they misunderstood?
Clearly the email package should document the conditions for which
success can be assured, if there are any... and that it is fair game to
raise exceptions if those conditions are not met.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Email-SIG
mailing list