[Email-SIG] fixing the current email module
Stephen J. Turnbull
stephen at xemacs.org
Sat Oct 10 15:59:03 CEST 2009
I'm running out of time to work on this (yeah, I know it's the
weekend, but my life is like that lately). I think we're converging,
though, so I'd like try and tie some of those ends together.
Glenn Linderman writes:
> On approximately 10/9/2009 8:10 AM, came the following characters from
> the keyboard of Stephen J. Turnbull:
> > Actually, I would say you are emitting leniently, in violation of the
> > Postel principle.
>
> You can say that, but I don't have to believe it. I'm talking about
> accepting; the message has arrived, it is here, the client is trying to
> look at it, and I'm talking about ways the client can look at
> not-quite-perfect data, knowing that it is not quite perfect, but still
> being able to see it. I'm not at all talking about emitting data.
It would be indeed, if the corrupt data is stored in the place where
correctly decoded data normally is stored, and is accessible in the
same way. But I gather that's not what you were talking about, my
mistake.
> You seem to be calling the email package helping the client to
> accept not-quite-perfect data, as a form of emitting data. It is
> not.
No, I was confused by the way you wrote. Saving the data *somewhere*
is absolutely necessary; not losing data is the #1 commandment of
low-level mail processing. Surely the email module is subject to that
commandment. *Nobody* is talking about losing any data yet, except
Barry indirectly when he says that some people think giving up on
invertibility (often called "idempotency"), and even he is quite
adamant that he's not going to give up on that.
So when you wrote about saving and converting to text form, without
mentioning that the specific APIs, I assumed you meant the "mainline"
APIs for parsing and accessing parts of a correctly formatted message.
> The email package cannot police the client... if it chooses to "eat it
> in a single gulp without looking at it" then it may get indigestion. I
> never suggested that "converting to Unicode as if it were Latin-1"
> should be done without informing the client, or being requested by the
> client to do that via a special API call...
Well, maybe I misread it, but it certainly looked like that to me. I
would not object to that special API call defaulting to ISO 8859/1.
> If you ignore defect reports, you are ignorant (blunt, but not intended
> to be offensive).
What I worried about is that if defect reports are present, *but
displayable data is also present*, programmers *will* simply display
it, for example in producing a prototype program. It will be
impossible to determine without very close analysis of that program
that an early version became a production version without adding
appropriate checks. In practice, this bug will be discovered when
some end user's installation breaks.
It seems that you agree with this, and because the special API call is
necessary, it will be easy to identify whether proper care is being
taken or not. Right?
> > > It is still raw user input, and should still be checked for proper
> > > syntax by the client,
> >
> > Nonsense. The email module had better know a lot more about syntax
> > than the client. If it doesn't, whack it with a 2x4 until it learns!
>
> I think we are talking at cross purposes here. I find it quite
> difficult to follow where you cross the boundary between talking about
> one sort of email package client, and then switch to another type, or
> switch to the responsibilities of the email package.
Excuse me? The "raw user input" you referred to above is material
that the client software receives from the email package. The email
package should give it to the client in the "normal" (convenient) way
only if it can certify that it conforms to the appropriate standard.
That standard should be specified in the API documentation. Any more
detailed structure, of course, is the responsibility of the client.
> An application which is using email as a transport, has specific goals,
> which require specific content. You were mentioning clients.
I've already said that when I speak of an MUA, I write "MUA". In
speaking of the calling program, which might even be a user running
the module via the Python interpreter, I write "client". It's a very
convenient way to describe the user of an API, in contrast to the
provider of the API (the implementation).
> If such a client doesn't validate the syntax of that content, it
> isn't much of an application.
If that MUA or email application uses RFC 822 addresses, it should be
able to rely on the email module to parse those addresses correctly,
or provide a defect report. One might even go so far as to suggest
that it be able to parse the (non-RFC, but very common) "+" notation
for separating the "mailbox" from "additional data" used for VERP and
challenge-response applications. That would have to be documented,
but if so documented client applications like the MUA should be able
to rely on it (and you can bet many will).
Application domain syntax of course is not the email module's problem
whether it arrives by email or Pony Express, and I'm really confused
why you're going so far afield.
> > No, they cannot just be raised. If you just raise the error, then the
> > next time you try to access unparsed data, you'll hit the error
> > again. If you use the same handler you did before, you're in an
> > infloop. So you need a second handler to do things differently this
> > time or a flag ... but it's unclear to me that that flag can be a
> > boolean. So you may as well store the defect list and information
> > about where to restart.
>
> From the point of view of the email package, the errors can just be
> raised. Then the client can make choices, and use other APIs or other
> parameters to the API to direct the email package to attempt a different
> technique to access the data.
The problem is that by this point some of the state of the parse may
be lost. We can't say "just raise", we need to say "interrupt the
parse, preserve state, and then raise". Python does absolutely
nothing to help with the problem of preserving the state. We also
need to determine just what state to preserve.
> Yes, I have learned that in my 34 years of programming. I agree.
>
> > So it's OK to write a lazy parser, but it must retain enough state so
> > that it can work forward until the end. [...]
>
> Are you speaking about parsing the message into MIME parts, or parsing a
> particular MIME part contained within the message, or both?
Both. I *believe* (but it needs to be checked) that in a correctly
formed multipart MIME object (message or part), any internal structure
is context-free within the MIME boundaries. If that is so, then
individual parts of the object can be stored in raw form and parsed
lazily.
Similarly, for any MIME or RFC 822 object, the object can be parsed
into header section and body section, and each can be stored and
parsed lazily, subject to the condition that the header section must
be sufficiently parsed to identify all headers that might affect
parsing the body part before the body part is parsed. That
"condition" is the context.
More information about the Email-SIG
mailing list