[Email-SIG] fixing the current email module
Glenn Linderman
v+python at g.nevcal.com
Fri Oct 9 22:26:19 CEST 2009
On approximately 10/9/2009 8:10 AM, came the following characters from
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
>
> > Emacs is different than email. Either you can read a file to edit it,
> > or you can't.
>
> *sigh* Emacs is as powerful a programming environment as Python, and
> applications regularly deal with network streams (HTTP, NNTP, and SMTP
> most commonly, but also raw X protocol and any kind of socket
> supported by the platform). So, yes, it's different from email,
> because it's *far* more general. That's precisely why I appreciate
> Bill's concerns about non-email usage.
>
OK, yes, Emacs is an operating system. I am an Emacs user. And yes, I
know Emacs can read email (I used it to read and write email, but found
it seriously lacking for the way I handle email, and annoying that the
email buffers and edit buffers were all in the same buffer pool, and I
quit using it for email). And I know it can be programmed, and I've
done a little of that, but I hate Lisp, so I mostly Google for the
packages that do what I need, and don't try to create my own.
> > The Postel principle for email says to try to do the best you can,
> > for as much as you can.
>
> Actually, it doesn't. It says be lenient in what you accept, strict
> in what you emit. You accept it ... but you don't have to do
> anything with it except preserve it verbatim for whoever wants it.
>
Yes, that is what it says, I agree. But unless you do the best you can,
for as much as you can, no one is going to want it, so they are
basically the same.
> > > > produce a defect report, but then simply converted to Unicode as if it
> > > > were Latin-1 (since there is no other knowledge available that could
> > > > produce a better conversion).
> > >
> > > No, that is already corruption. Most clients will assume that string
> > > is valid as a header, because it's valid as a string.
> >
> > Sure it is corruption. That's why there is a defect report. But
> > the conversion technique is appropriate, per the Postel principle.
>
> Actually, I would say you are emitting leniently, in violation of the
> Postel principle.
You can say that, but I don't have to believe it. I'm talking about
accepting; the message has arrived, it is here, the client is trying to
look at it, and I'm talking about ways the client can look at
not-quite-perfect data, knowing that it is not quite perfect, but still
being able to see it. I'm not at all talking about emitting data. You
seem to be calling the email package helping the client to accept
not-quite-perfect data, as a form of emitting data. It is not.
> You don't know what the client will do, they may
> eat it in a single gulp without looking at it. Thus you should avoid
> converting anything that you don't know what it is (unless
> specifically asked to do your best).
>
The email package cannot police the client... if it chooses to "eat it
in a single gulp without looking at it" then it may get indigestion. I
never suggested that "converting to Unicode as if it were Latin-1"
should be done without informing the client, or being requested by the
client to do that via a special API call... I was only talking about an
appropriate method of doing conversions in the presence of
not-quite-perfect data input, so that the client, and possibly even a
human, can try to make some sense out of the not-quite-perfect data.
> > Again, I mentioned producing a defect report. That is not passing
> > an error silently.
>
> But if I access that Unicode object without looking at the defect
> report, you *will* pass the error silently. OTOH, if I look at the
> defect report, I won't access the Unicode object.
>
If those are the only two choices you see, then you are not doing your
whole job.
If you ignore defect reports, you are ignorant (blunt, but not intended
to be offensive).
If you treat all defect reports as fatal errors, then you are not being
lenient in what you accept (non-Postel).
> > It is still raw user input, and should still be checked for proper
> > syntax by the client,
>
> Nonsense. The email module had better know a lot more about syntax
> than the client. If it doesn't, whack it with a 2x4 until it learns!
>
I think we are talking at cross purposes here. I find it quite
difficult to follow where you cross the boundary between talking about
one sort of email package client, and then switch to another type, or
switch to the responsibilities of the email package.
A client which is an MUA is just going to present the best possible data
to a human user, and is done. A client with is an email archiver
preserves the data for presenting via other MUAs.
An application which is using email as a transport, has specific goals,
which require specific content. You were mentioning clients. It is
this sort of client I thought you were talking about, and about which I
responded to. If such a client doesn't validate the syntax of that
content, it isn't much of an application. The email module does not,
and cannot, understand the application domain; it can only validate that
the message has proper (or improper) structure. The transported content
is fully the responsibility of the application to validate, parse, and
manipulate. The email module may detect if the transport cause garbling
in the structure of the message, and may be able to warn the application
about such garbling. But that may not prevent the application from
finding its content within even a garbled email, and so it may still be
able to validate, parse, and manipulate that content. Such clients may
transfer content either in headers or in MIME parts... in any case,
whatever client specific content is expected in those headers or MIME
parts should be validated by the client.
> > produces no defect report. If you don't want to check proper syntax in
> > your program inputs, I don't want to use your programs, they will be
> > insecure.
>
> So you're saying that every program that uses the email module should
> reproduce 100% of the functionality of the email module's parser, or
> it's insecure. And you imply that's an excuse for passing corrupt
> data to any client that asks for it.
>
> I disagree.
>
I'm glad you disagree with what you thought I was saying, because that
isn't what I was saying, and I also disagree with your paraphrase of
what I was saying. The email package should parse email. Where it
finds not-quite-perfect data, the client may get involved to choose a
path for interpreting the not-quite-perfect data... or to reject the
not-quite-perfect data.
Once the data from the email is discovered, then the client must operate
on the data. An MUA would simply display it to a human. Other clients
would attempt to interpret the content. The interpretation of the
content requires the client to parse, validate the syntax of, and
manipulate the content. An example would be a program that does
appointments via email. If it finds an appointment in a known format,
it enters it into the calendar. The email package knows nothing about
appointments or calendars (of the sort that hold appointments). It
cannot help, only the client can do that part of the job.
> > So there seem to be two techniques:
>
> Whatever gave you that idea?
>
I'm not sure you what you are asking here.
> > 2) Store the data, and convert only if the data is accessed.
>
> > With technique 2, little effort is required to store the data,
> > create a state variable to indicate whether it has been converted
>
> Why do that? It's always "False" in technique 2.
>
The first time it is always false. Subsequent requests can leverage the
work done by the first request, if results were created and cached.
> > and parsed, or not, and then IF (and only IF) the data is accessed,
> > the conversion and parsing must be done on the first access, and
> > instead of creating and storing metainformation about the errors,
> > they could just be raised.
>
> No, they cannot just be raised. If you just raise the error, then the
> next time you try to access unparsed data, you'll hit the error
> again. If you use the same handler you did before, you're in an
> infloop. So you need a second handler to do things differently this
> time or a flag ... but it's unclear to me that that flag can be a
> boolean. So you may as well store the defect list and information
> about where to restart.
>
From the point of view of the email package, the errors can just be
raised. Then the client can make choices, and use other APIs or other
parameters to the API to direct the email package to attempt a different
technique to access the data. If the technique is successful, then
progress is made. If unsuccessful, another error is raised by the
different technique. If there are more techniques, repeat. When out of
techniques, and no success, then the client needs to remember (possibly
with the help of APIs of the email package) that it cannot interpret
this data in a useful manner. If it then continues to attempt to access
the data using failed techniques, and goes into an infinite loop, then
the client has a bug.
> > So the Pythonic way, AFAIU, is that errors are returned out-of-band
> > via raised exceptions.
>
> Sure. But what you're missing is that "Neither rain, nor snow, nor
> dark of night may stop the Parser on her appointed rounds."
I haven't forgotten that, but clearly we haven't been communicating
effectively. That may be partly my fault, partly because I'm relatively
new to Python and to the email package (having only experimented with it
using Python 2.6, not coded inside it, to date), but I'm trying... I'm
hoping to write some email processing programs using the Python email
package, and so I do have a strong interest in this topic. I'm hoping I
don't have to start from scratch and write my own email package, because
Python's isn't functional enough, or doesn't perform well enough. Being
new to Python, I've chosen to focus on building my applications with
Python 3, understanding that there are fewer fully functional pieces in
that arena to date, and since email is one that has some rough edges
because of the Unicode strings, I'm trying to participate where I can.
> It is not
> easy to write parsers, but I'll tell you one thing: it's orders of
> magnitude harder to write a parser that starts in the middle and works
> outward, than one that starts at the beginning and works forward to
> the end.
>
Yes, I have learned that in my 34 years of programming. I agree.
> So it's OK to write a lazy parser, but it must retain enough state so
> that it can work forward until the end. Because you don't know that
> the client will not request the last character of the message, you
> need to be able to try to get it, no matter what happened to the first
> 10GB of the message. And if an exception occurs, it must be handled
> by the parser itself; if not, you put the poor thing in the position
> of starting over at the beginning (that way lies the madness of
> infloops), or trying to start a parse in the middle and work out.
>
Are you speaking about parsing the message into MIME parts, or parsing a
particular MIME part contained within the message, or both?
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Email-SIG
mailing list