[Email-SIG] fixing the current email module

Fri Oct 9 22:26:19 CEST 2009

On approximately 10/9/2009 8:10 AM, came the following characters from 
the keyboard of Stephen J. Turnbull:
> Glenn Linderman writes:
>
>  > Emacs is different than email.  Either you can read a file to edit it, 
>  > or you can't.
>
> *sigh* Emacs is as powerful a programming environment as Python, and
> applications regularly deal with network streams (HTTP, NNTP, and SMTP
> most commonly, but also raw X protocol and any kind of socket
> supported by the platform).  So, yes, it's different from email,
> because it's *far* more general.  That's precisely why I appreciate
> Bill's concerns about non-email usage.
>   

OK, yes, Emacs is an operating system.  I am an Emacs user.  And yes, I 
know Emacs can read email (I used it to read and write email, but found 
it seriously lacking for the way I handle email, and annoying that the 
email buffers and edit buffers were all in the same buffer pool, and I 
quit using it for email).  And I know it can be programmed, and I've 
done a little of that, but I hate Lisp, so I mostly Google for the 
packages that do what I need, and don't try to create my own.

>  > The Postel principle for email says to try to do the best you can,
>  > for as much as you can.
>
> Actually, it doesn't.  It says be lenient in what you accept, strict
> in what you emit.  You accept it ... but you don't have to do
> anything with it except preserve it verbatim for whoever wants it.
>   

Yes, that is what it says, I agree.  But unless you do the best you can, 
for as much as you can, no one is going to want it, so they are 
basically the same.

>  > >  > produce a defect report, but then simply converted to Unicode as if it 
>  > >  > were Latin-1 (since there is no other knowledge available that could 
>  > >  > produce a better conversion).
>  > >
>  > > No, that is already corruption.  Most clients will assume that string
>  > > is valid as a header, because it's valid as a string.
>  > 
>  > Sure it is corruption.  That's why there is a defect report.  But
>  > the conversion technique is appropriate, per the Postel principle.
>
> Actually, I would say you are emitting leniently, in violation of the
> Postel principle.  

You can say that, but I don't have to believe it.  I'm talking about 
accepting; the message has arrived, it is here, the client is trying to 
look at it, and I'm talking about ways the client can look at 
not-quite-perfect data, knowing that it is not quite perfect, but still 
being able to see it.  I'm not at all talking about emitting data.  You 
seem to be calling the email package helping the client to accept 
not-quite-perfect data, as a form of emitting data.  It is not.

> You don't know what the client will do, they may
> eat it in a single gulp without looking at it.  Thus you should avoid
> converting anything that you don't know what it is (unless
> specifically asked to do your best).
>   

The email package cannot police the client... if it chooses to "eat it 
in a single gulp without looking at it" then it may get indigestion.  I 
never suggested that "converting to Unicode as if it were Latin-1" 
should be done without informing the client, or being requested by the 
client to do that via a special API call... I was only talking about an 
appropriate method of doing conversions in the presence of 
not-quite-perfect data input, so that the client, and possibly even a 
human, can try to make some sense out of the not-quite-perfect data.

>  > Again, I mentioned producing a defect report.  That is not passing
>  > an error silently.
>
> But if I access that Unicode object without looking at the defect
> report, you *will* pass the error silently.  OTOH, if I look at the
> defect report, I won't access the Unicode object.
>   

If those are the only two choices you see, then you are not doing your 
whole job.

If you ignore defect reports, you are ignorant (blunt, but not intended 
to be offensive).
If you treat all defect reports as fatal errors, then you are not being 
lenient in what you accept (non-Postel).

>  > It is still raw user input, and should still be checked for proper 
>  > syntax by the client,
>
> Nonsense.  The email module had better know a lot more about syntax
> than the client.  If it doesn't, whack it with a 2x4 until it learns!
>   

I think we are talking at cross purposes here.  I find it quite 
difficult to follow where you cross the boundary between talking about 
one sort of email package client, and then switch to another type, or 
switch to the responsibilities of the email package.

A client which is an MUA is just going to present the best possible data 
to a human user, and is done.  A client with is an email  archiver 
preserves the data for presenting via other MUAs. 

An application which is using email as a transport, has specific goals, 
which require specific content.  You were mentioning clients.  It is 
this sort of client I thought you were talking about, and about which I 
responded to.  If such a client doesn't validate the syntax of that 
content, it isn't much of an application.  The email module does not, 
and cannot, understand the application domain; it can only validate that 
the message has proper (or improper) structure.  The transported content 
is fully the responsibility of the application to validate, parse, and 
manipulate.  The email module may detect if the transport cause garbling 
in the structure of the message, and may be able to warn the application 
about such garbling.  But that may not prevent the application from 
finding its content within even a garbled email, and so it may still be 
able to validate, parse, and manipulate that content.  Such clients may 
transfer content either in headers or in MIME parts... in any case, 
whatever client specific content is expected in those headers or MIME 
parts should be validated by the client.

>  > produces no defect report.  If you don't want to check proper syntax in 
>  > your program inputs, I don't want to use your programs, they will be 
>  > insecure.
>
> So you're saying that every program that uses the email module should
> reproduce 100% of the functionality of the email module's parser, or
> it's insecure.  And you imply that's an excuse for passing corrupt
> data to any client that asks for it.
>
> I disagree.
>   

I'm glad you disagree with what you thought I was saying, because that 
isn't what I was saying, and I also disagree with your paraphrase of 
what I was saying.  The email package should parse email.  Where it 
finds not-quite-perfect data, the client may get involved to choose a 
path for interpreting the not-quite-perfect data... or to reject the 
not-quite-perfect data.

Once the data from the email is discovered, then the client must operate 
on the data.  An MUA would simply display it to a human.  Other clients 
would attempt to interpret the content.  The interpretation of the 
content requires the client to parse, validate the syntax of, and 
manipulate the content.  An example would be a program that does 
appointments via email.  If it finds an appointment in a known format, 
it enters it into the calendar.  The email package knows nothing about 
appointments or calendars (of the sort that hold appointments). It 
cannot help, only the client can do that part of the job.

>  > So there seem to be two techniques:
>
> Whatever gave you that idea?
>   

I'm not sure you what you are asking here.

>  > 2) Store the data, and convert only if the data is accessed.
>
>  > With technique 2, little effort is required to store the data,
>  > create a state variable to indicate whether it has been converted
>
> Why do that?  It's always "False" in technique 2.
>   

The first time it is always false.  Subsequent requests can leverage the 
work done by the first request, if results were created and cached.

>  > and parsed, or not, and then IF (and only IF) the data is accessed,
>  > the conversion and parsing must be done on the first access, and
>  > instead of creating and storing metainformation about the errors,
>  > they could just be raised.
>
> No, they cannot just be raised.  If you just raise the error, then the
> next time you try to access unparsed data, you'll hit the error
> again.  If you use the same handler you did before, you're in an
> infloop.  So you need a second handler to do things differently this
> time or a flag ... but it's unclear to me that that flag can be a
> boolean.  So you may as well store the defect list and information
> about where to restart.
>   

 From the point of view of the email package, the errors can just be 
raised.  Then the client can make choices, and use other APIs or other 
parameters to the API to direct the email package to attempt a different 
technique to access the data.  If the technique is successful, then 
progress is made.  If unsuccessful, another error is raised by the 
different technique.  If there are more techniques, repeat.  When out of 
techniques, and no success, then the client needs to remember (possibly 
with the help of APIs of the email package) that it cannot interpret 
this data in a useful manner.  If it then continues to attempt to access 
the data using failed techniques, and goes into an infinite loop, then 
the client has a bug.

>  > So the Pythonic way, AFAIU, is that errors are returned out-of-band
>  > via raised exceptions.
>
> Sure.  But what you're missing is that "Neither rain, nor snow, nor
> dark of night may stop the Parser on her appointed rounds."  

I haven't forgotten that, but clearly we haven't been communicating 
effectively.  That may be partly my fault, partly because I'm relatively 
new to Python and to the email package (having only experimented with it 
using Python 2.6, not coded inside it, to date), but I'm trying...  I'm 
hoping to write some email processing programs using the Python email 
package, and so I do have a strong interest in this topic.  I'm hoping I 
don't have to start from scratch and write my own email package, because 
Python's isn't functional enough, or doesn't perform well enough.  Being 
new to Python, I've chosen to focus on building my applications with 
Python 3, understanding that there are fewer fully functional pieces in 
that arena to date, and since email is one that has some rough edges 
because of the Unicode strings, I'm trying to participate where I can.

> It is not
> easy to write parsers, but I'll tell you one thing: it's orders of
> magnitude harder to write a parser that starts in the middle and works
> outward, than one that starts at the beginning and works forward to
> the end.
>   

Yes, I have learned that in my 34 years of programming.  I agree.

> So it's OK to write a lazy parser, but it must retain enough state so
> that it can work forward until the end.  Because you don't know that
> the client will not request the last character of the message, you
> need to be able to try to get it, no matter what happened to the first
> 10GB of the message.  And if an exception occurs, it must be handled
> by the parser itself; if not, you put the poor thing in the position
> of starting over at the beginning (that way lies the madness of
> infloops), or trying to start a parse in the middle and work out.
>   

Are you speaking about parsing the message into MIME parts, or parsing a 
particular MIME part contained within the message, or both?

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking