[Web-SIG] WSGI Amendments thoughts: the horror of charsets

Fri Nov 14 18:14:08 CET 2008

Ian Bicking wrote:

> As it is (in Python 2), you should do something like 
> environ['PATH_INFO'].decode('utf8') and it should work.

See the test cases in my original post: this doesn't work universally. 
On WinNT platforms PATH_INFO has already gone through a decode/encode 
cycle which almost always irretrievably mangles the value.

> My understanding of this suggestion is that latin-1 is a way of 
> representing bytes as unicode. In other words, the values will be 
> unicode, but that will simply be a lie.

Yes, that would be a sensible approach, but it is not what is actually 
happening in any WSGI environment I have tested. For example 
wsgiref.simple_server decodes using UTF-8 not 8859-1 — or would do, if 
it were working. (It is currently broken in 3.0rc2; I put a hack in to 
get it running but I'm not really sure what the current status of 
simple_server in 3.0 is.)

> A lot of what you write about has to do with CGI, which is the only 
> place WSGI interacts with os.environ.  CGI is really an aspect of the 
> CGI to WSGI adapter (like wsgiref.handlers.CGIHandler), and not the WSGI 
> spec itself.

Indeed, but we naturally have to take into account implementability on 
CGI. If a WSGI spec *requires* PATH_INFO to have been obtained using 
8859-1 decoding — or UTF-8, which is the other sensible option given 
that most URIs today are UTF-8 — then there cannot be a fully-compliant 
CGI-to-WSGI wrapper. Perhaps it's not the big issue it was when WSGI was 
first getting off the ground, but IMO it's still important.

> Personally I'm more inclined to set up a policy on the WSGI server 
> itself with respect to the encoding, and then use real unicode 
> characters.

I think we are stuck with Unicode environ at this point, given the CGI 
issue. But applications do need to know about the encoding in use, 
because they will (typically) be generating their own links. So an 
optional way to get that information to the application would be 
advantageous.

I'm now of the opinion that the best way to do this is to standardise 
Apache's ‘REQUEST_URI’ as an optional environ item. This header is 
pre-URI-decoding, containing only %-sequences and not real high bytes, 
so it can be decoded to Unicode using any old charset without worry.

An application wanting to support Unicode URIs (or encoded slashes in 
URIs*) could then sniff for REQUEST_URI and use it in preference to 
PATH_INFO where available. This is a bit more work for the application, 
but it should generally be handled transparently by a library/framework 
and supporting PATH_INFO in a portable fashion already has warts thanks 
to IIS's bugs, so the situation is not much worse than it already is.

And of course we get support through mod_cgi and mod_wsgi automatically, 
so Graham doesn't have to do anything. :-)

Graham Dumpleton wrote:

> I can't really remember what the outcome of the discussion was.

Not too much outcome really, unfortunately! You concluded:

> there possibly still is an open question there on how
> encoding of non ascii characters works in practice. We just need to
> do some actual tests to see what happens and whether there is a problem. 

...to which the answer is — judging by the results posted — probably 
“yes”, I'm afraid!

-- 
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/