[Web-SIG] Python 3.0 and WSGI 1.0.

Graham Dumpleton graham.dumpleton at gmail.com
Thu Apr 2 13:33:07 CEST 2009

2009/4/2 Graham Dumpleton <graham.dumpleton at gmail.com>:
> Is there going to be any simple answer to all of this? :-(

I am slowly working through what I think I at least need to do for
Apache/mod_wsgi. I'll give a summary of what I have worked out so far
based on the discussions and my own research.

Just so I have a list of things to check off, I include an example
WSGI environment from a request and make comments about each category
of things from it.

First off is CGI HTTP variables.

HTTP_ACCEPT: 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5'
HTTP_ACCEPT_ENCODING: 'gzip, deflate'
HTTP_CONNECTION: 'keep-alive'
HTTP_HOST: 'home.dscpl.com.au'
HTTP_USER_AGENT: 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6;
en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1

The rule here from WSGI 1.0 amendments page in relation to Python 3.0 is:

"""When running under Python 3, servers MUST provide CGI HTTP
variables as strings, decoded from the headers using HTTP standard
encodings (i.e. latin-1 + RFC 2047)"""

Which is fair enough and basically what the RFCs say. At the moment I
don't apply RFC 2047 rules in Python 3.0 support in mod_wsgi, so just
need to do that.

An interesting one here to note is HTTP_HOST. The issue with this one
is what would happen for a unicode host name. For Apache an IDNA
(RFC3490) encoded host name has to be used to identify a site with
unicode host name. That is, one uses the IDNA name for ServerName or
ServerAlias directives.

When one gets a request one would actually see the IDNA name for
HTTP_HOST and that only uses latin-1 characters. For example:

  HTTP_HOST: 'xn--wgbe9chb01aytce.com'

These resolve in DNS okay:

  $ nslookup xn--wgbe9chb01aytce.com

  Non-authoritative answer:
  Name:	xn--wgbe9chb01aytce.com

Using HTTP live headers on Firefox can also confirm that that is what
would be sent:

  Host: xn--wgbe9chb01aytce.com

My understanding is that if a actual unicode string is given to a
browser, that it should translate it to the IDNA name before use.

Next HTTP header to worry about is HTTP_REFERRER.

There would be two parts to this, there would be the host name
component and then the path component.

We already know from above that for unicode host name it should be the
IDNA name.

For the path component, if the client follows the rules properly, then
if the path uses a non latin-1 encoding, then it should be using RFC
2047 to indicate this so shouldn't have to do anything different and
use same rule as other HTTP headers. For this header we are actually
in a better situation that for URL in actual HTTP request line which
isn't so specific about encodings.


Standard stuff which is always going to be latin-1, so encode as that.

REMOTE_PORT: '51378'

Again, latin-1 is okay.

SERVER_SOFTWARE: 'Apache/2.2.9 (Unix) mod_ssl/2.2.9 OpenSSL/0.9.7l
DAV/2 mod_wsgi/3.0-TRUNK Python/2.5.1'

Again, latin-1 is okay as Apache modules internally can only supply
normal C strings to add stuff to this.

SERVER_NAME: 'home.dscpl.com.au'

Same as HTTP_HOST and if a unicode host name would be IDNA encoded, so
can use latin-1 okay.

SERVER_ADMIN: 'you at example.com'

This is set by ServerAdmin directive. Because in Apache configuration
is effectively latin-1, probably can't even define a non latin-1 email
address. For host part, probably IDNA encoded anyway, so restriction
on latin-1 only perhaps pertinent to user part of email address. So,
latin-1 should be okay.


Depending on Apache configuration can be server name and version
information or server admin email address. All latin-1.

DOCUMENT_ROOT: '/Library/WebServer/Documents'
SCRIPT_FILENAME: '/Users/grahamd/Sites/echo.wsgi'

These are file system paths, and since the Apache Runtime Library used
for Apache 2.X has a define for whether file system supports unicode,
can say:

        charset = "UTF-8";
        charset = "ISO-8859-1";

For Apache 1.3, which doesn't have that define AFAIK, might just have
to assume latin-1, but possibly another way of doing it, or Apache 1.3
might have its own define for it.

PATH: '/usr/bin:/bin:/usr/sbin:/sbin'

Presume I can use APR_HAS_UNICODE_FS check again even though it is a
combination of paths.


Presume they will always use latin-1 for these.

All that is now left is the following, which we have already been discussing.

REQUEST_URI: '/~grahamd/echo.wsgi'
SCRIPT_NAME: '/~grahamd/echo.wsgi'

At least I am happy that except for these four, that there shouldn't
be any issues.

I'll keep watching what others come up with in respect of these and
see what consensus develops. :-)


More information about the Web-SIG mailing list