[Web-SIG] Python 3.0 and WSGI 1.0.
tseaver at palladion.com
Thu Apr 2 19:36:53 CEST 2009
-----BEGIN PGP SIGNED MESSAGE-----
Graham Dumpleton wrote:
> 2009/4/2 Graham Dumpleton <graham.dumpleton at gmail.com>:
>> Is there going to be any simple answer to all of this? :-(
> I am slowly working through what I think I at least need to do for
> Apache/mod_wsgi. I'll give a summary of what I have worked out so far
> based on the discussions and my own research.
> Just so I have a list of things to check off, I include an example
> WSGI environment from a request and make comments about each category
> of things from it.
> First off is CGI HTTP variables.
> HTTP_ACCEPT: 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5'
> HTTP_ACCEPT_ENCODING: 'gzip, deflate'
> HTTP_ACCEPT_LANGUAGE: 'en-us'
> HTTP_CONNECTION: 'keep-alive'
> HTTP_HOST: 'home.dscpl.com.au'
> HTTP_USER_AGENT: 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6;
> en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1
> The rule here from WSGI 1.0 amendments page in relation to Python 3.0 is:
> """When running under Python 3, servers MUST provide CGI HTTP
> variables as strings, decoded from the headers using HTTP standard
> encodings (i.e. latin-1 + RFC 2047)"""
> Which is fair enough and basically what the RFCs say. At the moment I
> don't apply RFC 2047 rules in Python 3.0 support in mod_wsgi, so just
> need to do that.
> An interesting one here to note is HTTP_HOST. The issue with this one
> is what would happen for a unicode host name. For Apache an IDNA
> (RFC3490) encoded host name has to be used to identify a site with
> unicode host name. That is, one uses the IDNA name for ServerName or
> ServerAlias directives.
> When one gets a request one would actually see the IDNA name for
> HTTP_HOST and that only uses latin-1 characters. For example:
> HTTP_HOST: 'xn--wgbe9chb01aytce.com'
> These resolve in DNS okay:
> $ nslookup xn--wgbe9chb01aytce.com
> Server: 192.168.1.254
> Address: 192.168.1.254#53
> Non-authoritative answer:
> Name: xn--wgbe9chb01aytce.com
> Address: 220.127.116.11
> Using HTTP live headers on Firefox can also confirm that that is what
> would be sent:
> Host: xn--wgbe9chb01aytce.com
> My understanding is that if a actual unicode string is given to a
> browser, that it should translate it to the IDNA name before use.
That is what the RFCs require, as well as the fact that un-encoded
unicode can't be written onto a socket.
> Next HTTP header to worry about is HTTP_REFERRER.
> There would be two parts to this, there would be the host name
> component and then the path component.
> We already know from above that for unicode host name it should be the
> IDNA name.
> For the path component, if the client follows the rules properly, then
> if the path uses a non latin-1 encoding, then it should be using RFC
> 2047 to indicate this so shouldn't have to do anything different and
> use same rule as other HTTP headers. For this header we are actually
> in a better situation that for URL in actual HTTP request line which
> isn't so specific about encodings.
> GATEWAY_INTERFACE: 'CGI/1.1'
> SERVER_PROTOCOL: 'HTTP/1.1'
> Standard stuff which is always going to be latin-1, so encode as that.
I think you mean 'decode' here? Unicode strings are encode to get
bytes; bytes are decoded to get unicode strings.
Also, I don't know of any reason why those values can be anything but ASCII.
> REMOTE_ADDR: '192.168.1.5'
> REMOTE_PORT: '51378'
> SERVER_PORT: '80'
> SERVER_ADDR: '192.168.1.5'
> Again, latin-1 is okay.
Likewise, these can't be anything but ASCII.
> SERVER_SOFTWARE: 'Apache/2.2.9 (Unix) mod_ssl/2.2.9 OpenSSL/0.9.7l
> DAV/2 mod_wsgi/3.0-TRUNK Python/2.5.1'
> Again, latin-1 is okay as Apache modules internally can only supply
> normal C strings to add stuff to this.
> SERVER_NAME: 'home.dscpl.com.au'
> Same as HTTP_HOST and if a unicode host name would be IDNA encoded, so
> can use latin-1 okay.
> SERVER_ADMIN: 'you at example.com'
> This is set by ServerAdmin directive. Because in Apache configuration
> is effectively latin-1, probably can't even define a non latin-1 email
> address. For host part, probably IDNA encoded anyway, so restriction
> on latin-1 only perhaps pertinent to user part of email address. So,
> latin-1 should be okay.
> SERVER_SIGNATURE: ''
> Depending on Apache configuration can be server name and version
> information or server admin email address. All latin-1.
> DOCUMENT_ROOT: '/Library/WebServer/Documents'
> SCRIPT_FILENAME: '/Users/grahamd/Sites/echo.wsgi'
> These are file system paths, and since the Apache Runtime Library used
> for Apache 2.X has a define for whether file system supports unicode,
> can say:
> #if APR_HAS_UNICODE_FS
> charset = "UTF-8";
> charset = "ISO-8859-1";
I'm not sure that works for arbitrary filesystem configurations: some
parts of the tree may be mounted from locations with different
encodings. See David Wheeler's analysis for more:
> For Apache 1.3, which doesn't have that define AFAIK, might just have
> to assume latin-1, but possibly another way of doing it, or Apache 1.3
> might have its own define for it.
> PATH: '/usr/bin:/bin:/usr/sbin:/sbin'
> Presume I can use APR_HAS_UNICODE_FS check again even though it is a
> combination of paths.
> REQUEST_METHOD: 'GET'
> Presume they will always use latin-1 for these.
RFC 2616, section 5.1.1 defines only ASCII methods; extension methods
are 'tokens', which must also be printable ASCII w/o separateros
> All that is now left is the following, which we have already been discussing.
> REQUEST_URI: '/~grahamd/echo.wsgi'
> SCRIPT_NAME: '/~grahamd/echo.wsgi'
> PATH_INFO: ''
> QUERY_STRING: ''
> At least I am happy that except for these four, that there shouldn't
> be any issues.
> I'll keep watching what others come up with in respect of these and
> see what consensus develops. :-)
Tres Seaver +1 540-429-0999 tseaver at palladion.com
Palladion Software "Excellence by Design" http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
-----END PGP SIGNATURE-----
More information about the Web-SIG