[Web-SIG] WSGI, Python 3 and Unicode

Andrew Clover and-py at doxdesk.com
Fri Dec 7 23:46:32 CET 2007


James Y Knight wrote:

> In addition, I know of nobody who actually implements RFC 2047
> decoding of http header values...nothing really uses it. (of
> course I don't know of all implementations out there.)

Certainly no browser supports it, which makes the point moot for WSGI. 
Most browsers, when quoting a header parameter, simply encode using the 
previous page's charset and put quotes around it... even if the 
parameter has a quote or control codes in it.

Ian wrote:

 > Is this all compatible with os.environ in py3k?

In 3.0a2 os.environ has Unicode strings for both keys and values. This 
is correct for Windows where environment variables are explicitly 
Unicode, but questionable (IMO) for Unix where they're really bytes that 
may or may not represent decodeable Unicode strings.

>> SCRIPT_NAME/PATH_INFO

This already causes problems in Windows CGI applications! Because these 
are passed in environment variables, IIS* has to decode the submitted 
bytes to Unicode first. It seems always to choose UTF-8 for this job, 
which I suppose is the least bad guess, but hardly infallible.

(* - haven't tested this with Apache for Windows yet.)

In Python 2.x, os.environ being byte strings, Python/the C library then 
has to encode them back to bytes, which I believe ends up using the 
system codepage. Since the system codepage is never UTF-8 on Windows 
this means not only that the bytes read back from eg. PATH_INFO are not 
the same as the original bytes submitted to the web server, but that if 
there are characters outside the system codepage submitted, they'll be 
unrecoverable.

If os.environ remains Unicode in Unix and WSGI follows it (as it must if 
CGI-invoked WSGI is to continue working smoothly), webapps that try to 
allow for non-ASCII characters in URLs are likely to get some nasty 
deployment problems that depend on the system encoding setting, 
something that will be particularly troublesome for end-users to debug 
and fix.

OTOH making the dictionaries reflect the underlying OS's conception of 
environment variables means users of os.environ and WSGI will have to be 
able to cope with both bytes and unicode, which would also be a big 
annoyance.

In summary: urgh, this is all messy and 'orrible.

-- 
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/


More information about the Web-SIG mailing list