[Web-SIG] WSGI Amendments thoughts: the horror of charsets
and-py at doxdesk.com
Mon Nov 17 18:54:24 CET 2008
Mark Hammond wrote:
> I don't think Python explicitly converts it - the CRT's ANSI version
> of environ is used
Yes, it would be the CRT on Python 2.x. (Python 3.0 on non-NT does a
conversion always using UTF-8, if I'm reading convertenviron right.)
> so the resulting strings should be encoded using the 'mbcs' encoding.
> What mangling do you see?
Correct, it's characters unencodable in mbcs that are lost*. mbcs is
never equivalent to UTF-8 (which would allow us to recover characters on
IIS) or ISO-8859 (which would allow us to receover characters on
Apache-for-Windows) so there's always heavy lossage.
(* - replaced with ? or Windows's attempt to substitute something that
looks vaguely like the original character.)
> win32api and ctypes would both let you call the Windows API.
Ah! I had considered the win32 extensions but it's a bit of a
dependency... I'd forgotten that we get ctypes for free in 2.5.
So we'd be looking at:
when CPython 2.5+/NT is detected, right? That increases the number of
situations in which we can feasibly recover URIs that are valid UTF-8
sequences (modulo the slash anyway). Doing the actual recovery still
requires some server-sniffing though.
> What is IIS doing wrong here?
It's not wrong as such. There are three reasonable choices for decoding
header values before putting them in a Unicode environment, and the CGI
spec, as it knows nothing about Unicode environment variables, fails to
1. ISO-8859-1 (which ensures bytes can be recovered)
2. UTF-8 (since most URIs are effectively UTF-8 today)
3. Configured system codepage (mbcs)
Apache [with mod_cgi or mod_wsgi] decides on (1). IIS tries for (2),
falling back to (3) on invalid sequences. The text concerning Python 3.0
in the WSGI Amendments page could be read as blessing Apache's behaviour.
However wsgiref.simple_server currently also goes for (2), although that
probably can't be considered canonical. I'd be interested to know what
other WSGI servers do.
mailto:and at doxdesk.com
More information about the Web-SIG