[Web-SIG] My experiences implement WSGI on java/j2ee/jython.
py-web-sig at xhaus.com
Mon Aug 30 02:32:23 CEST 2004
Firstly, I must say, I am totally impressed with the WSGI initiative. At
first at wasn't clear how such low level structures could improve the
fragmented situation with python-web frameworks. But now that I've spend
some time implementing a framework that complies with the spec, I
understand it a *lot* better, and can see a lot of it's benefits.
Secondly, I must apologise in advance for the length of this post :-)
I decided to write a java/j2ee/jython framework which layers WSGI on top
of java servlets. I decided this for a number of reasons
- Because I want WSGI to succeed, and in open-source chances of
success are greatly enhanced by running code.
- Because jython needs to be included in WSGI from the ground up.
- Because cpython and jython should be able to share web components.
- Because WSGI needs testing against as many server architectures as
- Because the best way to test the quality and usability of a spec is
to write software that implements it.
- Because I pray for the day when we can pick and mix capabilities
from the huge wealth of python web frameworks out there.
- Because J2EE (i.e. traditional servlets) are sometimes far too
restrictive, in terms of the way they handle cookies, authorisation,
etc, and require configuring lots of XML files, which can be a pain: I
don't like coding in XML, I like coding in python, where I can keep my
configuration all in an appropriate format.
- Because I want cpythonistas to keep jython in mind.
- Because someone had to do it :-), and I do J2EE and jython stuff all
the time in my work
- Because WSGI was small enough to implement in a day or two.
- A load of other good reasons.
My code is not ready for release. I only spent yesterday writing it:
it's not big, approx 500 lines of java. But I haven't even compiled it
yet, so it's got loads of syntax errors, no comments, no documentation,
etc. I expect the compilation and debugging to take a day or two.
However, I'm ridiculously busy at the moment, and really can't spare
much time. The fact that I sacrificed my weekend to get jython WSGI
up-and-running quickly may give you an idea of how important I consider
the WSGI initiative. I promise I'll release my code by next weekend,
whatever state it's in. If it's not 100% running, it'll be 90+% running,
My design for the moment is really just to show a proof of concept, and
a bare-bones framework. The framework will simply allow, through
configuration, the user to map an URL python file, and to specify the
name of a callalble object within that file, which will obviously be the
application. Application objects will be cached, based on the filename
they came from. The request will be dispatched to the application in a
WSGI compliant way. Simple. For the moment, I'm taking the easy way out,
in relation to things like threading guarantees. Anything that asks to
be single-threaded will still use a single instance, but calls from
multiple threads will be synchronized on that single object, which
wouldn't really work in a production framework. As WSGI evolves, I'll
make these kinds of facilities more robust, scalable.
I don't see the point yet in trying to build any more facilities into my
framework, e.g. url->object mapping, session management, page-template
management components, authorization, etc. Hopefully, all of these
facilities will become available as WSGI middleware components, written
in nice python: not java, or nasty apache conf files, or servlet
container XML files, blah, blah, blah.
Anyway, while was writing my thing (with printed WSGI spec in hand,
covered in annotations, tick marks and red ink :-), I came across a few
points in the spec that I'd like to raise about things that are either
observations, or things that are incompletely specified, or that induce
me to misunderstand, or seem just right or wrong.
Also, I've spent today catching up with the web-sig archives, to review
everyone's comments (now that I'm in a position to understand them), and
to make sure that I'm not trolling over old ground. So I've added one or
two points of my own, based on reading those archives. Hopefully some of
them will be useful.
Lastly, does have anyone have any name suggestions for a
java/j2ee/jython WSGI-compliant framework? I've been think along the
lines of "modjy", but I'm open to better ones :-)
So on to the points/questions.
0. On choice of CGI as a basis.
My experience with J2EE has clearly demonstrated to me that CGI is the
right choice to base WSGI upon. The J2EE servlet spec has a specific
method to return every single CGI variable: the specs even mention "this
method returns the same as the CGI varibale "SCRIPT_NAME", etc. My job
as "translator" couldn't have been easier. I expect that many other
containers/frameworks will also support the CGI spec in this way.
1. Default values of environment variables when not present.
The spec says that compulsory environment variables, for example
"CONTENT_LENGTH" or "CONTENT_TYPE", must have a value, i.e. "must be
present, but may be an empty string, if there is no more appropriate
value for them". I read "empty string" to mean "".
There are obviously two different choices for how to represent values
for headers/env-vars that are not present in the request, i.e. 1. an
empty string as described above or 2. as a python None value. It seems
more correct to me to use the latter option, None, for when the
header/env-var is not available, i.e. the client did not send it. This
allows the use of the "" value to indicate (the admittedly rare and
malformed case) that the client sent the header name, but did not
specify a header value. If WSGI uses the empty string for both cases,
then we lose the ability to distinguish between when the header was sent
with no value, and when it wasn't sent at all .
I don't think it's a big deal losing that ability, but I could imagine
that there might be, for example, some security application that might
like to have access to that information.
For simplicity of the spec, and robustness of servers/apps running on
WSGI, I understand why it is a good thing to make the default values as
robust as possible, i.e. in case some app author tries to use a header
value without checking if it is None first.
I suppose I'm really pointing out a possible wording difficulty in the
spec, which says "may be an empty string, if there is no more
appropriate value". To me None is "a more appropriate value" sometimes,
so I suppose I could legitimately interpret that to mean that I can use
None values in my WSGI-compliant framework, because my server
infrastructure allows me to detect their absence or lack of value.
So perhaps either the wording of the spec needs to be tightened up to
exclude this? Or the default environment values need to be more clearly
specified? Or perhaps a discussion of None vs. empty string needs to
added to the Q&A at the end?
2. The SCRIPT_NAME variable.
At first I was a little wary of the SCRIPT_NAME variable, and how I
would construct it, until I realised that the beginning of the
URL->Callable mapping is outside the scope of WSGI: it is in the control
of whichever program/process/container is receiving HTTP requests
through sockets from the client, and resolving/dispatching them
according to its configuration files: in my case that was a J2EE
container, e.g. Tomcat.
The J2EE call that returns a value equivalent to the CGI SCRIPT_NAME
variable is HTTPServletRequest.getServletPath method. It is an
interesting note on it which says that "This method will return an empty
string ("") if the servlet used to process this request was matched
using the "/*" pattern." Which seems a little odd, until you realise
that the SCRIPT_NAME = "" case is when the application object is
responsible for dealing with the entire URL space. Maybe it's worth
adding a note to this effect in the WSGI spec as well? It helped me
understand things better.
An idea occurs to me for a nice little reusable WSGI middleware
component which is a URI mapper, with functionality akin to apache
mod_rewrite, resolving URIs to python callable's. A lot of frameworks
like to do things with URL rewriting and mapping, in order to present a
nice clean URL interface to a tree of objects. Quixote is one such
framework that likes to have crisp URLs. But much of the time installing
such frameworks requires configuring apache and invoking mod_rewrite and
its "cool voodoo" to get the job done. Which can be difficult to debug
and get working, and scares newbies. (On re-reading the spec, and the
mailing list, I see I'm not the only one to have thought of such a uri
mapping component :-)
If I wrote such a reusable mapping component, I could then simply
configure my entire "container", e.g. Apache, Tomcat, etc, etc, to
simply resolve all requests for a URL hierarchy to my python component,
and nice-n-easy python code takes care of it from there, no mod_rewrite
rules, no complex java servlets mapping algorithm: just python. A big
win in terms of both installation simplicity and portability, since that
standard component could then be used across all WSGI frameworks and the
containers in which they live. I like this WSGI idea :-)
3. Status code and message.
The WSGI spec states that the status value passed to start_response
should be of the form "999 Message here". That's fine, I can parse up
the string easily enough to get the java data types I need to send to
the container. However, J2EE does not allow me to set the message
string: I can only set the status code, and that must have an integer
So, in terms of compliance with WSGI, am I in violation of the WSGI spec
by not transmitting the actual textual status message specified by the
application? If that's a problem, there's nothing I can do about it.
I wonder how often this will be the case with other server/container
4. Binary vs. textual writing.
Normally, python opens a file in text mode, line-ending translation
takes place on all python strings written to the file, changing '\n' to
whatever is the appropriate local line-ending. This is not noticeable on
*nix, since *nix uses the same line-ending character as python, '\n', so
no translation is necessary. This means that people running python on
*nix can write binary data through channels opened in text mode. On
other platforms though, namely Windows and MacOS, different line-endings
are used, and python's '\n' gets translated to '\r\n' and '\r'
respectively. Which corrupts binary files, e.g. .jpg, .gif, if they
contain '\n'. So Windows and MacOS python users must open files
explicitly in binary mode if they want to avoid this translation.
It is fundamental requirement (to me at least) that WSGI be able to
handle writing of binary data. And I'm fairly sure the intention for the
write() callable in WSGI is that it take python "strings", which
includes strings of binary data. But perhaps it needs to made explicitly
clear in the WSGI spec that the write() callable explicitly writes in
binary mode, i.e. that no translation is taking place on byte strings
passed to it, and the application/user is responsible for all encoding
concerns relating to byte strings passed to the write() callable.
5A. Python 2.1 vs. python 2.2: iterators and generators.
The WSGI spec says that python 2.2 features are required to be
compliant. However, it appears to me that the only python 2.2 features
in use are iterators and generators, used when the application object
returns an iterator. In fact, it's just that the example in the WSGI
spec uses a generator (and its corresponding 'yield' keyword): actual
applications are not required to use a generator: they can also return
an object that implements the iterator protocol. Which means returning
an object with a .next() method when the .__iter__() method is called.
The iterator.next() method keeps returning values, until the iterator
runs out, in which case it raises StopIteration. Like generators, the
iterator protocol was also introduced in python 2.2, but they are two
However, even though jython is based on python 2.1, and thus doesn't
have built-in support for either iterators or generators, I have still
implemented the iterator protocol in my java/jython framework, by simply
invoking the .__iter__() and .next() methods on application objects, and
catching StopIteration exceptions. So I can support components and
applications returning iterators, and I'm thus compliant with the spec,
even though I'm running on 2.1. (This is only possible because I'm
embedding: it is still not possible to support the iterator protocol in,
say, jython for-loops)
Does the spec need to be changed to reflect this iterators/versioning
issue? Or to more clearly define the difference between iterators and
It's conceivable that even a python 1.5 framework could be programmed to
support the iterator protocol: it's *very* easy to implement.
5B. A "python.version" WSGI variable?
Of course, it will be case that some middleware and applications will
require to use more advanced and recent (2.2, 2.3, 2.4) language
features, such as generators, generator expressions, decorators, etc.
But such components and applications will not be usable under jython,
which is 2.1. It would be nice for components and applications to have a
way of knowing what version of python they are running under. Similarly,
there will jython components and applications that require java
libraries, and thus won't be usable on cpython of any version.
Would it be useful to define a WSGI variable "python.version", similar
to "wsgi.version", which gives the python version in effect? In most
cases under jython, it wouldn't help, because its 2.1 compiler would
choke when loading python files with newer python syntax anyway, giving
syntax errors. But it might be useful in some circumstances, perhaps for
sophisticated dispatchers with the requisite meta-data available to
them? I'm not sure on this one. Maybe the values of sys.platform and
os.name give enough information to deal with this problem?
6. Streaming and flushing.
I see there has been discussion on the list about streaming output and
flushing. In one message, Philip said "I'm suggesting that write()
should be guaranteed to either:
1) Flush all output before returning, or
2) Put data in a buffer that will be emptied by another thread or by
To be a conforming implementation, a server/gateway must do one or the
In the J2EE case (and I'm sure with Apache CGI), that's very simple to
deal with, since the container will do it's own buffering completely
outside your control, and send the pieces with chunked-transfer encoding
if necessary. So even if I put a flush on the output channel in my
framework, I'm only flushing it to the container's buffer: it's still
not guaranteed to send output back down the return socket to the client.
Just a datapoint.
I read some discussion in the lists on how to handle container specific
facilities, e.g. Apache/mod_python's ability to internally redirect a
J2EE offers the same capabilities, to internally redirect a request,
without sending a response back to the client. It happens in a slightly
different way, because you first ask your container for a dispatcher,
based on a url, and then call that dispatcher to redirect to the URL.
And the client may not see any redirect HTTP responses: it's all
internal to the container.
I see the solution to this redirect platform-dependence problem in the
implementation of a platform-independent WSGI middleware component that
takes all responsiblity for redirects. This component examines the
wsgi.environment present, seeking hints for the optimal way to redirect
the request: if mod_python is available, use the mopd_python API call:
if modjy is available, use the getDispatcher(uri).redirect() dance, etc.
If none of these platform specific techniques are available, it can fall
back to sending a 302 or 307 response back to the client, and let the
client re-reqeust the new URL.
If the platform specific techniques are available, their availability
will be signalled in wsgi.envvars by the presence of variables such
"mod_python.request" or "modjy.servlet_context", etc. So one
ultraportable component could do it all (albeit chock full of special
8. Write callable and fileno()
It is a good idea to check for the fileno() attribute on the write
callable, since many platforms/frameworks have high-performance ways of
transferring file contents to sockets, for example. Java 1.4 nio has
this capability, through the use of directBuffers, memory-mapped files,
and special natively implemented methods to transfer between the two.
I'm be surprised if containers like Apache don't support something
similar. This can drastically improve throughput on static files.
Java objects have "channel"s, or "outputStream"s not "fileno"s. But
that's an easy problem to fix.
9. Server-detected headers.
I can see the reason for servers/containers intercepting client headers
and translating/augmenting/deleting them. However, do we need a
specification of what to do with certained specified headers? As with
CGI, should I recognise the "Status: " header or the "Location: "
header, and translate it to the relevant status code, or do a redirect,
respectively? If I don't do those translations, won't I be breaking
reams of python CGI code out there that relies on Apache doing this?
10. The "wsgi.errors" environment variable.
Under J2EE, setting the "wsgi.input" variable is easy, I just wrap the
HttpServletRequest.getInputStream() with an org.python.core.PyFile, and
However, the J2EE HttpServletRequest has no corresponding error stream,
nor does the corresponding HttpServletResponse paired with each request.
The only mechanism I can use to send error output is the "sendError(int,
message)" method of HttpServletResponse. Which allows me to send both an
integer status code and a textual message, which the J2EE docs say "The
server defaults to creating the response to look like an HTML-formatted
server error page containing the specified message, setting the content
type to "text/html", leaving cookies and other headers unmodified".
So I can't send error output this way without also knowing a status code
for it as well.
Which makes we wonder what the "wsgi.errors" variable is for? Yes, it's
for writing error data. But what do we expect to happen to data that
gets written to it? Will be it wrapped or translated in some way, and
and used to construct an error response to the user? Or should it be
locally logged by the server?
I know that this is all J2EE specific stuff, as is confirmed by the rest
of the documentation sentence I quoted above: "If an error-page
declaration has been made for the web application corresponding to the
status code passed in [to the sendError method], it will be served back
in preference to the suggested msg parameter." WSGI (rightly) has no
concept of "configured error page declarations", so it would seem the
"sendError" method is not the right method to use to implement
So I'm going to have to treat the error output in some other way, which
means I need to know more about what it is. Before I can implement a
jython framework that is fully compliant with the WSGI spec, I need to
know what will happen to any output send to "wsgi.errors", so that I can
code for whatever eventualities arise.
Or if it's always to be a framework specific thing, maybe I'll just
redirect all "wsgi.errors" output to /dev/null, for example? The J2EE
ServletContext for each servlet has a "log(message)" method. Maybe I
should just send error output there, in which case it will end in the
That's all for now.
More information about the Web-SIG