[Web-SIG] String Types in WSGI [Graham's WSGI for py3]

Fri Sep 18 09:56:23 CEST 2009

2009/9/18 Armin Ronacher <armin.ronacher at active-4.com>:
> Hi,
>
> Graham currently proposes[1] the following behaviors for Strings in WSGI
> (Python version independent).  However this mail only covers the Python
> 3 part which I assume becomes a separate section in the PEP or even WSGI
> version.
>
> Terminology:
>
>  byte string == contains bytes
>  unicode string == contains unicode charpoints*
>  native string == what the python version uses a a string
>                   (bytes in python 2, unicode in python 3)
>
>  * ucs2 / ucs4 is ignored here.  You might still have problems
>    with surrogate pairs in ucs2 python builds and jython.
>
>> 2. For the WSGI variable 'wsgi.url_scheme' contained in the WSGI
>> environment, the value of the variable should be a native string.
>
> URLs in general are a tricky topic.  For this particular field it does
> not matter if we decide on bytes or unicode because it will always only
> contain ASCII characters.  This should be picked consistencly with the
> type of PATH_INFO and SCRIPT_NAME.

I believe it does matter and that it contains ASCII possibly doesn't
mean it is somehow simpler. The reason is that URL reconstruction
recipe as per WSGI PEP has to work. Ie.,

from urllib import quote
url = environ['wsgi.url_scheme']+'://'

if environ.get('HTTP_HOST'):
    url += environ['HTTP_HOST']
else:
    url += environ['SERVER_NAME']

    if environ['wsgi.url_scheme'] == 'https':
        if environ['SERVER_PORT'] != '443':
           url += ':' + environ['SERVER_PORT']
    else:
        if environ['SERVER_PORT'] != '80':
           url += ':' + environ['SERVER_PORT']

url += quote(environ.get('SCRIPT_NAME',''))
url += quote(environ.get('PATH_INFO',''))
if environ.get('QUERY_STRING'):
    url += '?' + environ['QUERY_STRING']

In Python 2.X you can concatenate byte strings and unicode strings:

>>> 'http' + u'://'
u'http://'

In Python 3.X you cannot concatenate byte strings and unicode strings:

>>> b'http'+'://'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str

On the basis that SCRIPT_NAME, PATH_INFO and QUERY_STRING when used by
a user in Python 3.X were likely to be held as unicode strings, then
saw wsgi.url_scheme needing to be of same type, albeit specified as
native string so still byte string as we are accustomed to in Python
2.X now.

This is also why all the other CGI variables are similarly make to be
unicode strings. That is, so all the same type and stuff like URL
reconstruction will work.

If bytes is used, you could potentially end up with messy situations
where you have to perform URL reconstruction as bytes, but then
convert it to unicode strings to stuff it in as a parameter into some
templating system where the template text is unicode.

If SCRIPT_NAME, PATH_INFO and QUERY_STRING are in bytes form and they
needed different encodings, how do you easily convert your bytes
strings to the unicode string needed to stuff in the template. Can't
see how you could, they really need to be in unicode if everything
else in the system is going to be unicode. Or are templating systems
now going to be expected to drop down and use bytes all the time as
well.

> However Graham moves further away from that in the rest of the blog post
> because he wants to point out that people use WSGI directly and that
> explicit bytestrings in Python 3 confuse people.  The latest iteration
> in the blog post is not to use bytestrings in a single location except
> for headers and the input stream.

Plus the response content would need to be bytes, albeit allowing an
ISO-8859-1 fallback if unicode like other response items. The use of
unicode exclusively is only really a big factor in WSGI environment
variables.

> I thought a lot about this in the past and I welcome the step to make
> WSGI harder to use!  This might sound absurd, but once encodings are
> really explicit, people will think about it.  I think we should
> discourage *applications* written in WSGI and link to implementations in
> the PEP.

As a way of deterring a lot of users, making it harder to use, or at
least making it more obvious that thought is required, would be quite
effective.

This would also be good in pushing people to use existing
frameworks/toolkits which deal with all this stuff internally and hide
it and instead present unicode strings at a higher level after doing
everything correctly.

So, it may well curtail the NIH issue that is becoming a problem, but
am not sure that doing that and making it harder for users who want to
work at that level, is a good idea.

As others have pointed out, the likes of rack and jack, not sure about
the new Perl variant, don't seem to have an issue with using unicode.

> The big problems are always PATH_INFO and SCRIPT_NAME.  Those are the
> only values that are in the dict URL-decoded and might contain non-ASCII
> characters. (except for headers, but that's a different story because
> the only real-world problem there are cookie headers and those are
> troubleing for more reasons than just character sets)
>
> My latest change to the WSGI sandbox hg repo [2] was that I added a
> notice that later PEP revisions might document a RAW_SCRIPT_NAME or
> something that contains the URL quoted values.  It however turns out
> that this value is not available from within a webserver context (We're
> talking about Apache and IIS here) so that the problem of unquoted
> values will not go away.

I am still waiting for the good explanation of why access to the raw
URL quoted values is so important. Can you please explain what the
requirement is?

The only example I recall was related to web servers eliminating
repeating slashes thereby effectively not making it possible to have
URLs in query strings with out a custom encoding string. Since there
are alternatives, I don't find that alone a compelling argument.

> It also introduces the concept of URI encodings.  I'm especially unhappy
> with this part.  It would mean that implementations would have to follow
> the WSGI URI encoding if set.

No it doesn't. The whole point of providing wsgi.uri_encoding was so
that a WSGI application would know the encoding so as to be able to
reverse it to bytes and convert it to something else. Given that you
accept below that most of the time latin1 or UTF-8 would be used, then
the typical case would be handled automatically and so that
transcoding wouldn't be required.

> Most of the applications are using either
> latin1 or UTF-8 URLs, I would leave that including the decoding of *all*
> incoming data to the user.
>
> So yes, I'm all for definition #1 in the blog post where Graham says:
>
>> The first is that although WSGI 1.0 on Python 3.X should strictly be
>> bytes everywhere as per Definition #1, it is probably too late to
>> enforce this now.
> I don't think so.  Reasoning: Python 3.0 does not work and is considered
> outdated, Python 3.1 might ship with a wsgiref that's against a
> revisioned spec, but cgi.FieldStorage is still broken there, making it
> impossible to use for anything but small applications.

I'll summarise where people are falling in respect of which definition
that want in a later post after more of the key figures have indicated
their choices.

Graham