[Web-SIG] WSGI, Python 3 and Unicode
Andrew Clover
and-py at doxdesk.com
Fri Dec 7 23:46:32 CET 2007
James Y Knight wrote:
> In addition, I know of nobody who actually implements RFC 2047
> decoding of http header values...nothing really uses it. (of
> course I don't know of all implementations out there.)
Certainly no browser supports it, which makes the point moot for WSGI.
Most browsers, when quoting a header parameter, simply encode using the
previous page's charset and put quotes around it... even if the
parameter has a quote or control codes in it.
Ian wrote:
> Is this all compatible with os.environ in py3k?
In 3.0a2 os.environ has Unicode strings for both keys and values. This
is correct for Windows where environment variables are explicitly
Unicode, but questionable (IMO) for Unix where they're really bytes that
may or may not represent decodeable Unicode strings.
>> SCRIPT_NAME/PATH_INFO
This already causes problems in Windows CGI applications! Because these
are passed in environment variables, IIS* has to decode the submitted
bytes to Unicode first. It seems always to choose UTF-8 for this job,
which I suppose is the least bad guess, but hardly infallible.
(* - haven't tested this with Apache for Windows yet.)
In Python 2.x, os.environ being byte strings, Python/the C library then
has to encode them back to bytes, which I believe ends up using the
system codepage. Since the system codepage is never UTF-8 on Windows
this means not only that the bytes read back from eg. PATH_INFO are not
the same as the original bytes submitted to the web server, but that if
there are characters outside the system codepage submitted, they'll be
unrecoverable.
If os.environ remains Unicode in Unix and WSGI follows it (as it must if
CGI-invoked WSGI is to continue working smoothly), webapps that try to
allow for non-ASCII characters in URLs are likely to get some nasty
deployment problems that depend on the system encoding setting,
something that will be particularly troublesome for end-users to debug
and fix.
OTOH making the dictionaries reflect the underlying OS's conception of
environment variables means users of os.environ and WSGI will have to be
able to cope with both bytes and unicode, which would also be a big
annoyance.
In summary: urgh, this is all messy and 'orrible.
--
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/
More information about the Web-SIG
mailing list