[Python-Dev] PEP 3333: wsgi_string() function

Fri Jan 7 12:51:01 CET 2011

Le jeudi 06 janvier 2011 à 23:50 +0000, And Clover a écrit :
> On Tue, 2011-01-04 at 03:44 +0100, Victor Stinner wrote:
> > What is this horrible encoding "bytes-as-unicode"?
> 
> It is a unicode string decoded from bytes using ISO-8859-1. ISO-8859-1
> is the encoding specified by the HTTP RFC, as well as having the happy
> property of preserving every input byte. PEP 3333 requires it.

ISO-8859-1 for all fields: SERVER_NAME, PATH_INFO, the URL, form
data, ...?

> > os.environ is supposed to be correctly decoded and contain valid
> unicode characters.
> 
> It is not possible to ‘correctly’ decode to unicode for os.environ
> because that decoding happens long before the web application (the
> only party that knows what encoding should be in use) gets a look in.

Agreed.

> Maybe the web application is using UTF-8, maybe it's using cp1252,
> but if we let the server/gateway decide and do that decoding (...)
> It's an absolutely necessary idea. The locale encoding is nothing 
> to do with the web application's encoding. (...)

Ok, so you must pass byte strings to the server/gateway. If you pass
unicode, how do the server/gateway know that it has to redecode a value?
Should it redecode all values? Anything, it is stupid to use a temporary
useless pseudo-encoding (bytes-in-unicode).

> The recoding dances present in wsgiref's CGIHandler for 3.2 are
> distasteful but completely necessary to normalise differences in
> encodings used by various servers and platforms to generate their CGI
> environment.

I don't understand why read_environ() gives unicode values: as you
explained, the server/gateway will have to encode the values again, and
then finally to decode them from the correct encoding.

On POSIX, the current code looks like that:

 a) the OS pass a bytes environ to the program
 b) Python decodes environ from the locale encoding
 c) wsgi.read_environ() encodes environ to the locale encoding to get
back the original bytes environ: this step can be skipped if os.environb
is available
 d) wsgi.read_environ() decodes environ from ISO-8859-1
 e) the server/gateway encodes environ to ISO-8859-1
 f) the server/gateway decodes environ from the right encoding

Hey! Don't you think that there are useless encode/decode steps here?
Especially (d)-(e) is useless and introduces a confusion: the environ
contains other keys that don't come from os.environ and are already
correctly decoded, how do the the server/gateway know that they are
already correctly decoded?

I propose simply (for Python 3.2):

 a) the OS pass a bytes environ to the program: wsgi.read_environ() uses
it
 b) the server/gateway decodes environ from the right encoding

and...

> (a) os.environb doesn't exist in previous Python 3.1, making it
> impossible to implement WSGI before 3.2;

For Python 3.1, add a step between (a) and (b): encode environ to the
locale encoding (with surrogateescape) to get back the original bytes
environ.

> (b) a byte environment on Windows would have to be encoded
> from the Unicode environment, with a server-specific encoding,
> and then what encoding are you going to choose for the variables
> that contain non-HTTP-sourced native Unicode strings (such as,
> very commonly, Windows pathnames)?

The variables coming from the HTTP server should be encoded again to the
server-specific encoding. Other variables should be kept unchanged.

The server/gateway can simply test the type of the variable: if it's
uncode, nothing to do, if it's bytes: decode it from the correct
encoding.

> The bytes-or-bytes-in-Unicode argument is something that has been
> bounced around Web-SIG for literally *years*; (...) WSGI and wsgiref
> in Python 3.0-3.1 simply does not work.

I don't understand why you are attached to this horrible hack
(bytes-in-unicode). It introduces more work and more confusing than
using raw bytes unchanged.

It doesn't work and so something has to be changed.

Victor