[Web-SIG] Python 3.0 and WSGI 1.0.

Mon May 4 17:02:48 CEST 2009

Hello everybody,

I just recently started looking at supporting Python 3 with one of my libraries
(Werkzeug), mainly because the MoinMoin projects considers using it which uses
the library in question.  Right now what Werkzeug does is consider HTTP being
Unicode aware in the sense that everything that carries text data is encoded and
decoded into a known encoding.

This is partially against the specification and not entirely correct, but it
works the best on modern browsers and is also what Django and Paste are doing.

It's basically that the incoming request data is .decode(encoding)d (usually
utf-8) before passed to the user code and unicode data is encoded back into the
same encoding before it's sent to the server.

Now why is the current behavior of Python 3 a problem here?  The encode, decode
hack from above is obviously a solution for these kinds of applications, albeit
not a good one.  Interfaces like mod_wsgi already have the data as bytestring,
would decode it from latin1 just that the application can encode it back and
decode as utf-8.  Not only is this slow but also does this mean that the code
does not survive a run through 2to3.

Now you could argue that the libraries where wrong in the first place and should
support unicode strings that were encoded from latin1 and decoded, but seems
like very few libraries support that.

Now which strings carry data that could contain non-ascii characters from a
source with an unknown encoding?  Right now these are the following:

  * PATH_INFO
  * SCRIPT_NAME
  * QUERY_STRING
  * CONTENT_TYPE
  * HTTP_*

Also all headers that carry non integer values (like HTTP_CONTENT_TYPE and
CONTENT_TYPE).  Now it's true that the headers should not contain non latin1
values but reality shows that they do.  Cookies are transmitted as headers as
well and no browser complains if you put utf-8 encoded stuff into it.  It may be
the case that for the browser this looks like latin1, but in the end the
application decodes it from utf-8 and is happy.

Data sent from the application can continue to work like they do currently. 
However for django, Werkzeug, paste and many others that support unicode output
will just check if the output is unicode, and if that's the case, encode to the
desired encoding.

Also people abuse middlewares a lot and they deal with incoming and outgoing
data as well.  One can expect these middlewares to work on known encodings as
well so those would do the encode / decode dance too.

If one knows the encoding of the environ, then the webserver.  Apparently there
are issues getting the encoding of the environ but those won't go away when
moving that to the web application.

Because of that I propose that Python 3 would ship a version of wsgiref with
Python 3.1 that uses bytestrings for the headers in question and add a section
on Python 3 compatibility based on that to PEP 333.

I volunteer for writing a new section on Python 3 in PEP 333 :-)

Regards,
Armin