[Python-Dev] PEP 3333: wsgi_string() function
Victor Stinner
victor.stinner at haypocalc.com
Fri Jan 7 12:51:01 CET 2011
Le jeudi 06 janvier 2011 à 23:50 +0000, And Clover a écrit :
> On Tue, 2011-01-04 at 03:44 +0100, Victor Stinner wrote:
> > What is this horrible encoding "bytes-as-unicode"?
>
> It is a unicode string decoded from bytes using ISO-8859-1. ISO-8859-1
> is the encoding specified by the HTTP RFC, as well as having the happy
> property of preserving every input byte. PEP 3333 requires it.
ISO-8859-1 for all fields: SERVER_NAME, PATH_INFO, the URL, form
data, ...?
> > os.environ is supposed to be correctly decoded and contain valid
> unicode characters.
>
> It is not possible to ‘correctly’ decode to unicode for os.environ
> because that decoding happens long before the web application (the
> only party that knows what encoding should be in use) gets a look in.
Agreed.
> Maybe the web application is using UTF-8, maybe it's using cp1252,
> but if we let the server/gateway decide and do that decoding (...)
> It's an absolutely necessary idea. The locale encoding is nothing
> to do with the web application's encoding. (...)
Ok, so you must pass byte strings to the server/gateway. If you pass
unicode, how do the server/gateway know that it has to redecode a value?
Should it redecode all values? Anything, it is stupid to use a temporary
useless pseudo-encoding (bytes-in-unicode).
> The recoding dances present in wsgiref's CGIHandler for 3.2 are
> distasteful but completely necessary to normalise differences in
> encodings used by various servers and platforms to generate their CGI
> environment.
I don't understand why read_environ() gives unicode values: as you
explained, the server/gateway will have to encode the values again, and
then finally to decode them from the correct encoding.
On POSIX, the current code looks like that:
a) the OS pass a bytes environ to the program
b) Python decodes environ from the locale encoding
c) wsgi.read_environ() encodes environ to the locale encoding to get
back the original bytes environ: this step can be skipped if os.environb
is available
d) wsgi.read_environ() decodes environ from ISO-8859-1
e) the server/gateway encodes environ to ISO-8859-1
f) the server/gateway decodes environ from the right encoding
Hey! Don't you think that there are useless encode/decode steps here?
Especially (d)-(e) is useless and introduces a confusion: the environ
contains other keys that don't come from os.environ and are already
correctly decoded, how do the the server/gateway know that they are
already correctly decoded?
I propose simply (for Python 3.2):
a) the OS pass a bytes environ to the program: wsgi.read_environ() uses
it
b) the server/gateway decodes environ from the right encoding
and...
> (a) os.environb doesn't exist in previous Python 3.1, making it
> impossible to implement WSGI before 3.2;
For Python 3.1, add a step between (a) and (b): encode environ to the
locale encoding (with surrogateescape) to get back the original bytes
environ.
> (b) a byte environment on Windows would have to be encoded
> from the Unicode environment, with a server-specific encoding,
> and then what encoding are you going to choose for the variables
> that contain non-HTTP-sourced native Unicode strings (such as,
> very commonly, Windows pathnames)?
The variables coming from the HTTP server should be encoded again to the
server-specific encoding. Other variables should be kept unchanged.
The server/gateway can simply test the type of the variable: if it's
uncode, nothing to do, if it's bytes: decode it from the correct
encoding.
> The bytes-or-bytes-in-Unicode argument is something that has been
> bounced around Web-SIG for literally *years*; (...) WSGI and wsgiref
> in Python 3.0-3.1 simply does not work.
I don't understand why you are attached to this horrible hack
(bytes-in-unicode). It introduces more work and more confusing than
using raw bytes unchanged.
It doesn't work and so something has to be changed.
Victor
More information about the Python-Dev
mailing list