[Web-SIG] CGI WSGI and Unicode
and-py at doxdesk.com
Tue Dec 8 16:27:41 CET 2009
Manlio Perillo wrote:
> In a CGI application, HTTP headers are Unicode strings, and are decoded
> using system default encoding.
> In a future WSGI application, HTTP headers are Unicode strings, and are
> decoded using latin-1 encoding.
Yes. As proposed, WSGI 1.1 would require CGI-to-WSGI handler to undo the
decode stage caused by reading environ using the default encoding. At
least this is now reliably possible thanks to surrogateescape.
PATH_INFO is the only really important HTTP-related environment variable
for Unicode. Potentially SCRIPT_NAME could also be significant in
relation to PATH_INFO. The HTTP headers don't massively matter because
there are almost never any non-ASCII characters in them.
Previously the job of undoing an unwanted decode step was dumped on
whatever read the PATH_INFO; usually a routing component, which would
have to make guesses with typically poor results. The CGI adapter is in
a much better place to do it, being closer to the server.
> The problem is that not all browsers use latin-1.
Not WSGI's problem. WSGI will deliver bytes encoded into Unicode
strings, not ready-to-use Unicode strings. It is up to the application
to decide how they want to handle those bytes; maybe they want Latin-1
and can do nothing, maybe they want to recode to UTF-8, maybe something
else completely. No solution satisfies every app so there is always
going to have to be a recode step somewhere.
An application that doesn't want to think about this will use a
framework that does it for them.
> What about HTTP_COOKIE?
For what it's worth, the choice of Latin-1 here results in the 'right'
Unicode string for more browsers than any other potential encoding.
In any case as previously discussed, non-ASCII cookies are already
totally broken everywhere and hence used by no-one.
mailto:and at doxdesk.com
More information about the Web-SIG