[Web-SIG] Python 3.0 and WSGI 1.0.

Tres Seaver tseaver at palladion.com
Thu Apr 2 19:36:53 CEST 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Graham Dumpleton wrote:
> 2009/4/2 Graham Dumpleton <graham.dumpleton at gmail.com>:
>> Is there going to be any simple answer to all of this? :-(
> 
> I am slowly working through what I think I at least need to do for
> Apache/mod_wsgi. I'll give a summary of what I have worked out so far
> based on the discussions and my own research.
> 
> Just so I have a list of things to check off, I include an example
> WSGI environment from a request and make comments about each category
> of things from it.
> 
> First off is CGI HTTP variables.
> 
> HTTP_ACCEPT: 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5'
> HTTP_ACCEPT_ENCODING: 'gzip, deflate'
> HTTP_ACCEPT_LANGUAGE: 'en-us'
> HTTP_CONNECTION: 'keep-alive'
> HTTP_HOST: 'home.dscpl.com.au'
> HTTP_USER_AGENT: 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6;
> en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1
> Safari/525.27.1'
> 
> The rule here from WSGI 1.0 amendments page in relation to Python 3.0 is:
> 
> """When running under Python 3, servers MUST provide CGI HTTP
> variables as strings, decoded from the headers using HTTP standard
> encodings (i.e. latin-1 + RFC 2047)"""
> 
> Which is fair enough and basically what the RFCs say. At the moment I
> don't apply RFC 2047 rules in Python 3.0 support in mod_wsgi, so just
> need to do that.
> 
> An interesting one here to note is HTTP_HOST. The issue with this one
> is what would happen for a unicode host name. For Apache an IDNA
> (RFC3490) encoded host name has to be used to identify a site with
> unicode host name. That is, one uses the IDNA name for ServerName or
> ServerAlias directives.
> 
> When one gets a request one would actually see the IDNA name for
> HTTP_HOST and that only uses latin-1 characters. For example:
> 
>   HTTP_HOST: 'xn--wgbe9chb01aytce.com'
> 
> These resolve in DNS okay:
> 
>   $ nslookup xn--wgbe9chb01aytce.com
>   Server:		192.168.1.254
>   Address:	192.168.1.254#53
> 
>   Non-authoritative answer:
>   Name:	xn--wgbe9chb01aytce.com
>   Address: 208.78.242.184
> 
> Using HTTP live headers on Firefox can also confirm that that is what
> would be sent:
> 
>   Host: xn--wgbe9chb01aytce.com
> 
> My understanding is that if a actual unicode string is given to a
> browser, that it should translate it to the IDNA name before use.

That is what the RFCs require, as well as the fact that un-encoded
unicode can't be written onto a socket.

> Next HTTP header to worry about is HTTP_REFERRER.
> 
> There would be two parts to this, there would be the host name
> component and then the path component.
> 
> We already know from above that for unicode host name it should be the
> IDNA name.
> 
> For the path component, if the client follows the rules properly, then
> if the path uses a non latin-1 encoding, then it should be using RFC
> 2047 to indicate this so shouldn't have to do anything different and
> use same rule as other HTTP headers. For this header we are actually
> in a better situation that for URL in actual HTTP request line which
> isn't so specific about encodings.
> 
> GATEWAY_INTERFACE: 'CGI/1.1'
> SERVER_PROTOCOL: 'HTTP/1.1'
> 
> Standard stuff which is always going to be latin-1, so encode as that.

I think you mean 'decode' here?  Unicode strings are encode to get
bytes;  bytes are decoded to get unicode strings.

Also, I don't know of any reason why those values can be anything but ASCII.

> REMOTE_ADDR: '192.168.1.5'
> REMOTE_PORT: '51378'
> SERVER_PORT: '80'
> SERVER_ADDR: '192.168.1.5'
> 
> Again, latin-1 is okay.

Likewise, these can't be anything but ASCII.

> SERVER_SOFTWARE: 'Apache/2.2.9 (Unix) mod_ssl/2.2.9 OpenSSL/0.9.7l
> DAV/2 mod_wsgi/3.0-TRUNK Python/2.5.1'
> 
> Again, latin-1 is okay as Apache modules internally can only supply
> normal C strings to add stuff to this.
> 
> SERVER_NAME: 'home.dscpl.com.au'
> 
> Same as HTTP_HOST and if a unicode host name would be IDNA encoded, so
> can use latin-1 okay.
> 
> SERVER_ADMIN: 'you at example.com'
> 
> This is set by ServerAdmin directive. Because in Apache configuration
> is effectively latin-1, probably can't even define a non latin-1 email
> address. For host part, probably IDNA encoded anyway, so restriction
> on latin-1 only perhaps pertinent to user part of email address. So,
> latin-1 should be okay.
> 
> SERVER_SIGNATURE: ''
> 
> Depending on Apache configuration can be server name and version
> information or server admin email address. All latin-1.
> 
> DOCUMENT_ROOT: '/Library/WebServer/Documents'
> SCRIPT_FILENAME: '/Users/grahamd/Sites/echo.wsgi'
> 
> These are file system paths, and since the Apache Runtime Library used
> for Apache 2.X has a define for whether file system supports unicode,
> can say:
> 
>   #if APR_HAS_UNICODE_FS
>         charset = "UTF-8";
>   #else
>         charset = "ISO-8859-1";
>   #endif

I'm not sure that works for arbitrary filesystem configurations:  some
parts of the tree may be mounted from locations with different
encodings.  See David Wheeler's analysis for more:

 http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html

> For Apache 1.3, which doesn't have that define AFAIK, might just have
> to assume latin-1, but possibly another way of doing it, or Apache 1.3
> might have its own define for it.
> 
> PATH: '/usr/bin:/bin:/usr/sbin:/sbin'
> 
> Presume I can use APR_HAS_UNICODE_FS check again even though it is a
> combination of paths.
> 
> REQUEST_METHOD: 'GET'
> 
> Presume they will always use latin-1 for these.

RFC 2616, section 5.1.1 defines only ASCII methods;  extension methods
are 'tokens', which must also be printable ASCII w/o separateros
(section 2.2).

> All that is now left is the following, which we have already been discussing.
> 
> REQUEST_URI: '/~grahamd/echo.wsgi'
> SCRIPT_NAME: '/~grahamd/echo.wsgi'
> PATH_INFO: ''
> QUERY_STRING: ''
> 
> At least I am happy that except for these four, that there shouldn't
> be any issues.
> 
> I'll keep watching what others come up with in respect of these and
> see what consensus develops. :-)


Tres.
- --
===================================================================
Tres Seaver          +1 540-429-0999          tseaver at palladion.com
Palladion Software   "Excellence by Design"    http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJ1Pe1+gerLs4ltQ4RArt6AJ9GMmvjQd6LfH4MSC1yzNUTO6r51ACg3Ocl
3bOgMrQUlFy+ZSehv8gsSLM=
=r4vt
-----END PGP SIGNATURE-----



More information about the Web-SIG mailing list