[Web-SIG] Stuff left to be done on WSGI

Sat Aug 28 06:51:57 CEST 2004

Phillip J. Eby wrote:
>> I would hope that we can come to some consensus and produce something 
>> useable before 2.5, with the understanding that it will be included in 
>> 2.5.  I would kind of like to see a "web" package.
> 
> 
> I think we'll have better luck with a 'wsgi' package, but I could be 
> wrong.  'web' just seems like a nuisance attractor for all sorts of 
> unproductive bickering on so many levels.
> 
> On a more immediate practical level, we'd be crazy to try to claim 'web' 
> for a third-party package that we want to propose for the stdlib, but a 
> package named 'wsgi' would be more than fair game.

I would only want to use "web" if we could get agreement that it would 
be in 2.5 under that name.  I was thinking of it like a package for 
various Python web-related modules (the Next Generation; forgoing this 
current generation which is all in the root).

Almost all the modules in the root have issues.  Well, let's enumerate...

webbrowser: this seems like a totally weird module to me
cgi: ick ick ick.
cgitb: this is okay.
urllib: defunct?
urllib2: surpisingly hard to use in a number of ways.  There was some 
discussion about this early in Web-SIG.  I think the client stuff John 
Lee has done at: http://wwwsearch.sourceforge.net/ is better, and I 
think he's interested in that direction.  Probably not right now, but at 
some point this could well improve on urllib*
httplib: actually okay, kind of; needed for some things that urllib 
can't do.  But it also seems redundant in other ways.
urlparse: like os.path, this is a rather annoying module to use, though 
I guess it works fine.  I'd like to see something like Jason Orendorf's 
path module, but for URLs.
BaseHTTPServer, SimpleHTTPServer, CGIHTTPServer: it seems odd that this 
is three modules.  And none of the three actually claims to work that 
well.  It's wonky.  They're useful modules, but limited in scope.
Cookie: weird interface.  Has some insecure parts.  I think mod_python 
differs mostly in that it has secure alternatives.
xmlrpclib: a good module.
SimpleXMLRPCServer: like the HTTPServers, seems a little odd.
DocXMLRPCServer: what a weird module.
robotparser: never knew this existed.
HTMLParser: lives in the world between web and XML.  Some of the client 
tools in wwwserver are very HTML-centric as well.  But it all fits together.
htmllib: deprecated, I think?  Or HTMLParser?  I don't know what's going 
on here.
htmlentitydefs: another odd little module.

Anyway, I think there's a case to be made for a new generation of web 
libraries, and a package to bring them together.

I don't know if we need deeper hierarchy than that.  E.g., 
web.wsgi.cgiadapter.  I don't think so.  I'd rather "WSGI" be a term 
only those in the know use -- it means nothing unless you expand the 
acronym, and even then it's pretty vague.  Ultimately I hope most web 
programmers just don't need to think about any of it.

>>> There's little harm in having a separate 'wsgi' distribution until 
>>> 2.5 rolls around.  I'm thinking the package should include:
>>>  * BaseHTTPServer-based WSGI server
>>>  * CGI-based WSGI gateway (run WSGI apps under CGI)
>>
>>
>> You've noted these are missing error handling.  What kind were you 
>> thinking of specifically?
>>
>> There's exception handling, which seems straight forward.
> 
> 
> Well, to be honest, I haven't a clue what one does about errors *after* 
> the headers are written.  You can't send anything useful to the client, 
> because the status is already set.
> 
> If you sent a Content-Length, you can break the connection before that 
> point, and it's a fair guess the client will know something's wrong.  If 
> you *didn't* send a content length and break the connection, the client 
> gets an incomplete file and maybe doesn't know it.  Sending an error 
> message once 'write()' has been called will garble the output.
 >
> All of these options are especially unsatisfactory when binary files are 
> involved, where "unsatisfactory" could mean anything from "annoying" to 
> "catastrophic" (e.g. garbling an executable).

Yes, you are right.  Which means the catcher has to keep track of the 
headers that were sent if it hopes to do anything.  In that case, it 
might check for text/html or text/plain; if not those two, then just 
stop the response short and log the error.  If so, and if configured to 
show errors, then it could display them; cgitb goes to some length to 
make HTML render correctly.

That makes me think that wrapping send_response is more reasonable. 
Though it makes error resolution in servers more complex.

>>   Spec compliance?  Certainly an anal version of these servers should 
>> be written, that checks every type passed around, looks for common 
>> mistakes, etc.  I don't know if the anal and the useable version need 
>> to be the same thing.
> 
> 
> I wasn't even addressing spec compliance, although test suites for all 
> the implementations, factored so that they could be used as a basis for 
> testing other implementations, would certainly be nice.

Yes, I've meant to work on this.  I have a simple "echo" application 
that sends results based on the query; throwing errors, displaying text, 
displaying the environ, etc.  I was thinking that along with a client 
could make a good structure for further testing.  Then the echo 
application could be coded in different styles of application as well -- 
for instance, jonpy, and the same tests run.  It would be useful for 
testing middleware as well.  I'll try to give it a go sometime soon.

>> Two models -- one that optimistically tries to load the cgi module in 
>> a fake environment (what I did), plus another that actually runs any 
>> CGI script.
> 
> I'm not following what the difference is, exactly, but I guess we'll 
> need to get into the design more.

One runner would actually fork a process and run the CGI script 
separately.  This would be useful for, say, implementing CGIHTTPServer 
in terms of WSGI.  It would always work, because it would actually run 
the script as a CGI script.

>> I don't think the utility functions are a big deal at all, and I worry 
>> that there's some gotchas to email.Message, specifically where it is 
>> intended for email.  So I'm certainly not adamantly opposed to 
>> email.Message, but I'm not adamantly for it either.  I'd rather see a 
>> superclass of email.Message (such a superclass does not yet exist, but 
>> should be easy to write/extract) that is more minimal.
> 
> 
> Why don't you take a look at the code?  I have. 

Well good, now I don't need to ;)

> Here are the methods:
> 
> as_string, __str__ -- format the message as a string
> 
> is_multipart -- returns true if payload has been set to a list

Can you do this with HTTP?  I know some MIME stuff works (like 
content-disposition: attachment; filename=blah).  Would this work too? 
In a meaningful way?  The cgi module has some weird MIME stuff in it 
that I don't think any web client has ever exercised.

> get_unixfrom/set_unixfrom, add_payload/set_payload/get_payload/attach, 
> get_charsets, walk -- stuff for manipulating parts of the message we 
> don't care about.

Yes.  If these accidentally are used, will it effect the as_string 
representation?

> set_charset/get_charset -- sets the character set parameters of the 
> content-type, which is actually useful.  On the down side, setting the 
> character set sets MIME-Version, but it also sets the 
> Content-Transfer-Encoding, so it doesn't force the server to default one.

Would that start opening up the possibility of accepting Unicode to 
write()/app_iter?

> __len__, __getitem__, __setitem__, __delitem__, __contains__, has_key, 
> get, keys, values, items -- case-insensitive dictionary-like interface 
> (i.e., the stuff we mainly want)
> 
> get_all -- all values for a header name
> 
> add_header, replace_header -- more stuff we want

Very good, though not hard to reimplement.

> get_type, get_main_type, get_subtype, get_content_type, 
> get_content_maintype, get_content_subtype, get_content_subtype, 
> get_param, get_params, set_param, del_param, set_type, get_boundary, 
> set_boundary, get_content_charset -- miscellaneous content-type analysis 
> and manipulation.  Not necessarily very helpful, except maybe for 
> middleware.  But they hardly hurt.
> 
> get_filename -- extract filename from Content-Disposition if present.  
> Not particularly helpful, but also not damaging in any way.

Sure.

> 
> Perhaps more eyes should look at this, but I haven't found anything in 
> here that's damaging or even annoying apart from setting MIME-Version if 
> it's not there and the content-type is touched.

Okay, looking through the code briefly, I can't help but think that all 
the complex parts are parts we don't care about.  A case-insensitive 
dictionary that accepts multiple values for a key isn't hard to 
implement.  Certainly we could match the interface of email.Message 
where it applies.  If it ended up in the standard library, that's fine 
-- it's one of those things people keep reinventing anyway, so a 
canonical implementation would be good.

>>> The only other thing that comes to mind is requiring servers to 
>>> support multiple 'start_response' calls in some way that makes sense 
>>> for exception handlers, while requiring it to still work in the case 
>>> where an extension API has already been used for output.
>>
>>
>> That seems too hard.
> 
> 
> Well, to some extent we have to look at the question of what should 
> happen in those circumstances anyway, whether we solve the problem in 
> that specific way or not.  Because if the application *does* call 
> start_response more than once, the server has to be able to handle it 
> *somehow*.  Really, the ultimate error handling *has* to be done by 
> servers, unless they want to take the route of crashing the entire 
> process when something bad happens.  :)

Good question.  I think servers should consider that an error, but they 
should handle that error gracefully.  Which probably means keeping a 
"has send_response already been called" flag.

Now, if I could get access to that flag from middleware... and maybe 
access to the headers and status that have already been sent... (and 
really, why not?  We aren't worried about streaming headers like we are 
about bodies)

-- 
Ian Bicking  /  ianb at colorstudy.com  / http://blog.ianbicking.org