From and-py at doxdesk.com  Tue Dec  1 01:44:48 2009
From: and-py at doxdesk.com (And Clover)
Date: Tue, 01 Dec 2009 01:44:48 +0100
Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
In-Reply-To: <88e286470911282028o5849c853od3b8239cc59f8d00@mail.gmail.com>
References: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>	<88e286470911271327p24dc978at5ee46e3ad1c99220@mail.gmail.com>	<88e286470911281944s1a926ccaq600682e8aa573912@mail.gmail.com>
	<88e286470911282028o5849c853od3b8239cc59f8d00@mail.gmail.com>
Message-ID: <4B146700.3010608@doxdesk.com>

Graham Dumpleton wrote:

> Answering my own question, it is actually obvious that it has to be
> called (1, 0). This is because wsgiref in Python 3.X already calls it
> (1, 0) and don't have much choice to be in agreement with that.

wsgiref.simple_server in Python 3 to date is not something that anyone 
should worry about being compatible with. It is a 2to3 hack that cannot 
meaningfully claim to represent wsgi version anything.

Careless use of urllib.parse.unquote causes 3.0's simple_server not to 
work at all, and 3.1's to mangle the path by treating it as UTF-8 
instead of ISO-8859-1, as 'WSGI 1.1' proposed and mod_wsgi (and even 
mod_cgi via wsgiref.CGIHandler) delivered.

Yes, I'm always going on about Unicode paths. I'm fed up of shipping 
apps with a page-long deployment note about fixing them. It pains me 
that in so many years both this and "What do we do about Python 3?" 
still haven't been addressed.

mod_wsgi 3.0 already has more traction than wsgiref 3.1 and I would 
prefer not to see more farcical reverse-progress at this point.

For what it's worth my responses on the issues of this thread. But at 
this point I really just want a BDFL to just come and do it, whatever it 
is. A new WSGI, whatever the version number, is massively overdue.

 >> 1. The 'readline()' function of 'wsgi.input' may optionally take a 
size hint.

Yes. Obviously. Bad practice but unavoidable now. Should have been a 1.0 
amendment a long time ago.

 >> 2. The 'wsgi.input' must provide an empty string as end of input 
stream marker.
 >> 3. The size argument to 'read()' function of 'wsgi.input' would be 
optional and if not supplied the function would return all available 
request content.
 >> 4. The 'wsgi.file_wrapper' supplied by the WSGI adapter must honour 
the Content-Length response header and must only return from the file 
that amount of content.

+0. Seems reasonable but don't massively care. Presumably an application 
must refuse to run on 1.0 if it requires these behaviours?

 >> 5. Any WSGI application or middleware should not return more data 
than specified by the Content-Length response header if defined.
 >> 6. The WSGI adapter must not pass on to the server any data above 
what the Content-Length response header defines if supplied.

Yes.

-- 
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/

From foom at fuhm.net  Tue Dec  1 02:41:39 2009
From: foom at fuhm.net (James Y Knight)
Date: Mon, 30 Nov 2009 20:41:39 -0500
Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
In-Reply-To: <01CC6582-B272-4CC8-B87A-95B683965FB2@fuhm.net>
References: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>
	<88e286470911271327p24dc978at5ee46e3ad1c99220@mail.gmail.com>
	<88e286470911281944s1a926ccaq600682e8aa573912@mail.gmail.com>
	<01CC6582-B272-4CC8-B87A-95B683965FB2@fuhm.net>
Message-ID: <D0732212-B1E6-4760-9252-1351EE8564C2@fuhm.net>

On Nov 29, 2009, at 12:40 AM, James Y Knight wrote:
> The next step here is clearly for someone to redraft the changes as a diff against PEP 333. If you do not have any interest in being that person, please make that clear, so someone else can step up to do so.

Okay, not sensing any other volunteers here...I guess it's all me.

The intention of this spec update is to be compatible with existing middleware/applications when running on Python 2.X. Apps/middleware running on python 3.X require changes in any case, and this specification will tell them exactly what to expect. That Python 3.X middleware and WSGI adapters will have to deal with both bytestrings and unicode strings in many parts of the API (output status code, output headers, output response iterable/write callback) will add some complexity, but that's life.

Any WSGI implementations on Python 3.X claiming compliance to WSGI 1.0 are most likely broken, and its behavior cannot be relied upon. Too bad about wsgiref.

As self-appointed author, I am going to take a stand and say that both the python3-related string-type specifications, and the additional requirements except #3 (read() with no-args) and #4 (file_wrapper looking at Content-Length), will be included.

And it will be called WSGI 1.1.

Back to the list of "extra requirements":

#1: (readline with an arg) must be included, despite the potential for breakage. That ship has already sailed, the breakage has already occurred, it's already required. Disagreement here really is of no consequence.

#2: (wsgi.input() must return EOF at EOF): I do not believe will break any middleware. It will require some changes in some WSGI adapter implementations, but that's acceptable. If you have a real-life example of middleware that would break here, show it. So this will be included.

#3 is not actually required for anything; at best it's an extra convenience; repeatedly reading until EOF will work just as well. Furthermore, the API change has the potential to break some middleware in Python 2.X, so I'll take the safe road and not make the change.

The purpose behind #4 is essentially included in #6, and so is not needed as a separate requirement.

#5 and #6 are uncontroversial and of no impact to an already-correct implementation. They will be included.

I'll send a diff of the actual wording changes once I've written it.

James

From manlio_perillo at libero.it  Thu Dec  3 11:55:51 2009
From: manlio_perillo at libero.it (Manlio Perillo)
Date: Thu, 03 Dec 2009 11:55:51 +0100
Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
In-Reply-To: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>
References: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>
Message-ID: <4B179937.8070305@libero.it>

James Y Knight ha scritto:
> I move to bless mod_wsgi's definition of WSGI 1.1 [1]
> [...]
> 
> [1] http://code.google.com/p/modwsgi/wiki/SupportForPython3X

Hi.

Just a few questions.

It is true that HTTP headers can be encoded assuming latin-1; and they
can be encoded using PEP 383.

However what about URI (that is, for PATH_INFO and the like)?
For URI (if I remember correctly) the suggested encoding is UTF-8, so
URLS should be decoded using

  url.decode('utf-8', 'surrogateescape')

Is this correct?


Now another question.
Let's consider the `wsgiref.util.application_uri` function

def application_uri(environ):
    url = environ['wsgi.url_scheme']+'://'
    from urllib.parse import quote

    if environ.get('HTTP_HOST'):
        url += environ['HTTP_HOST']
    else:
        url += environ['SERVER_NAME']

        if environ['wsgi.url_scheme'] == 'https':
            if environ['SERVER_PORT'] != '443':
                url += ':' + environ['SERVER_PORT']
        else:
            if environ['SERVER_PORT'] != '80':
                url += ':' + environ['SERVER_PORT']

    url += quote(environ.get('SCRIPT_NAME') or '/')
    return url


There is a potential problem, here, with the quote function.
This function does the following:

def quote(string, safe='/', encoding=None, errors=None):
    if isinstance(string, str):
        if encoding is None:
        encoding = 'utf-8'
        if errors is None:
            errors = 'strict'
            string = string.encode(encoding, errors)

This means that if we use surrogateescape, the informations about
original bytes is lost here.

This can be easily fixed by changing the application_uri function, but
this also means that a WSGI application will not work with Python 3.1.x.


Finally, a question about cookies.
Cookie data SHOULD be transparent to the server/gateway; however WSGI is
going to assume that data is encoded in latin-1.

I don't know what the HTTP/Cookie spec says about this.
However, from a WSGI application point of view, the cookie data can, as
an example, contain some text encoded in UTF-8; this means that the
application must first encode the data:

  cookie_bytes = cookie.encode('latin-1', 'surrogateescape')

and then decode it using UTF-8:

  my_cookie_data = cookie_bytes.decode('utf-8')


This is a bit unreasonable, but I don't know if this is a common
practice (I do this, just to make an example).


Manlio Perillo

From manlio_perillo at libero.it  Thu Dec  3 15:49:08 2009
From: manlio_perillo at libero.it (Manlio Perillo)
Date: Thu, 03 Dec 2009 15:49:08 +0100
Subject: [Web-SIG] HTTP headers encoding
Message-ID: <4B17CFE4.3020504@libero.it>

Hi.

I'm doing some tests to try to understand how HTTP headers are encoded
by browsers.

I have written a simple WSGI application that asks authentication
credentials and then print them on the terminal and return the data as
response, as raw bytes
http://paste.pocoo.org/show/154633/

Then I used some browsers to try to send an username with non ascii
characters.


When I try with simple characters in the iso-8859-1 charset, things
works well; the data is encoded using this charset.

However when I try to use some extraneus character, like Euro, there are
problems.

Firefox (Iceweasel 3.0.14, Linux Debian Squeeze) sends me a
'\xac'

I don't know where \xac come from, but it is the last byte in the utf-8
encoded Euro: '\xe2\x82\xac'


Internet Explorer 6.0 sends me a
'\x80'
and this this the Euro characted encoded using cp1252 (and I suspect
that it always use this encoding, instead of iso-8859-1).

Unfortunately I can not test with IE 7 and 8.


With a browser working on a terminal, like lynx, things get worse.
If I enter as user name the string "??", lynx sends me
'\xc3\xa0\xc3\xa8'

This happens in a GNOME terminal, with an it_IT.utf8 locale.

wget and curl do the same.


Can someone else reproduce this?


Thanks   Manlio

From manlio_perillo at libero.it  Thu Dec  3 17:09:31 2009
From: manlio_perillo at libero.it (Manlio Perillo)
Date: Thu, 03 Dec 2009 17:09:31 +0100
Subject: [Web-SIG] HTTP headers encoding
In-Reply-To: <4B17CFE4.3020504@libero.it>
References: <4B17CFE4.3020504@libero.it>
Message-ID: <4B17E2BB.9040806@libero.it>

Manlio Perillo ha scritto:
> Hi.
> 
> I'm doing some tests to try to understand how HTTP headers are encoded
> by browsers.
> 
> I have written a simple WSGI application that asks authentication
> credentials and then print them on the terminal and return the data as
> response, as raw bytes
> http://paste.pocoo.org/show/154633/
> 

I'm now testing using HTTP Digest Authentication.
The application is here:
http://paste.pocoo.org/show/154667/

It uses my wsgix framework
http://hg.mperillo.ath.cx/wsgix/
since I don't want to rewrite the entire Digest Authentication handling.


As user name I use the the string "???".
The results are:

- Firefox does not send any request, and instead it show me the returned
  response body "Authentication required".

  This is quite strange.

- Internet Explorer 6 encode the username using cp1252, as always.

- Opera (10.01) encode the username using utf-8

I can not test with Konqueror, since the wsgiref server have problems
with it.


All these implementation are against the HTTP spec.
username is a quoted string, and so it SHOULD be encoded using the
default latin-1, or another charset and in this case it should be
formatted as specified my MIME (unfortunately there are no examples in
the HTTP spec).


This is really a mess.
How is authorization username handled in common WSGI frameworks?


Thanks  Manlio

From and-py at doxdesk.com  Thu Dec  3 19:35:14 2009
From: and-py at doxdesk.com (And Clover)
Date: Thu, 03 Dec 2009 19:35:14 +0100
Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
In-Reply-To: <4B179937.8070305@libero.it>
References: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>
	<4B179937.8070305@libero.it>
Message-ID: <4B1804E2.9070807@doxdesk.com>

Manlio Perillo wrote:

> However what about URI (that is, for PATH_INFO and the like)?
> For URI (if I remember correctly) the suggested encoding is UTF-8, so
> URLS should be decoded using

>   url.decode('utf-8', 'surrogateescape')

> Is this correct?

The currently-discussed proposal is ISO-8859-1, allowing the real bytes 
to be trivially extracted. This is consistent with the other headers and 
would be my preferred approach.

Python 3.1's wsgiref.simple_server, on the other hand, blindly uses 
urllib.unquote, which defaults to UTF-8 without surrogateescape, 
mangling any non-UTF-8 input.

I don't really care whether UTF-8+surrogateescape or ISO-8859-1 encoding 
is blessed. But *something* needs to be blessed. An encoding, an 
alternative undecoded path_info, both, something else... just *something*.

> Let's consider the `wsgiref.util.application_uri` function
> There is a potential problem, here, with the quote function.

Yes. wsgiref is broken in Python 3.1. Not quite as broken as it was in 
3.0, but still broken. Until we can come to a Pronouncement on what WSGI 
*is* in Python 3, it is meaningless anyway.

> Cookie data SHOULD be transparent to the server/gateway; however WSGI is
> going to assume that data is encoded in latin-1.

Yeah. This is no big deal because non-ASCII characters in cookies are 
already broken everywhere(*). Given this and other limitations on what 
characters can go in cookies, they are habitually encoded using ad-hoc 
mechanisms handled by the application (typically a round of URL-encoding).

*: in particular:

- Opera and Chrome send non-ASCII cookie characters in UTF-8.
- IE encodes using the system codepage (which can never be UTF-8),
   mangling any characters that don't fit in the codepage through the
   traditional Windows 'similar replacement character' scheme.
- Mozilla uses the low byte of each UTF-16 code point (so ISO-8859-1
   gets through but everything else is mangled)
- Safari refuses to send any cookie containing non-ASCII characters.

> I don't know what the HTTP/Cookie spec says about this.

The traditional interpretation of RFC2616 is that headers are ISO-8859-1.

You will notice that no browser correctly follows this.

...sigh.

-- 
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/


-- 
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/


From manlio_perillo at libero.it  Thu Dec  3 19:52:14 2009
From: manlio_perillo at libero.it (Manlio Perillo)
Date: Thu, 03 Dec 2009 19:52:14 +0100
Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
In-Reply-To: <4B1804E2.9070807@doxdesk.com>
References: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>	<4B179937.8070305@libero.it>
	<4B1804E2.9070807@doxdesk.com>
Message-ID: <4B1808DE.5080705@libero.it>

And Clover ha scritto:
> [...]
>> Cookie data SHOULD be transparent to the server/gateway; however WSGI is
>> going to assume that data is encoded in latin-1.
> 
> Yeah. This is no big deal because non-ASCII characters in cookies are
> already broken everywhere(*). Given this and other limitations on what
> characters can go in cookies, they are habitually encoded using ad-hoc
> mechanisms handled by the application (typically a round of URL-encoding).
> 
> *: in particular:
> 
> - Opera and Chrome send non-ASCII cookie characters in UTF-8.
> - IE encodes using the system codepage (which can never be UTF-8),
>   mangling any characters that don't fit in the codepage through the
>   traditional Windows 'similar replacement character' scheme.
> - Mozilla uses the low byte of each UTF-16 code point (so ISO-8859-1
>   gets through but everything else is mangled)
> - Safari refuses to send any cookie containing non-ASCII characters.
> 

Thanks for this summary.
I think it should go in a wiki or in a separate document (like
rationale) to the WSGI spec.

However this should never happen with cookie, since cookie data is
opaque to browser, and it MUST send it "as is".

What you describe happen with other headers containing TEXT.
And now I understand that strange behaviour of Firefox with non latin-1
strings in username, in HTTP Basic Authentication.

> [...]

Regards   Manlio

From foom at fuhm.net  Thu Dec  3 20:00:27 2009
From: foom at fuhm.net (James Y Knight)
Date: Thu, 3 Dec 2009 14:00:27 -0500
Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
In-Reply-To: <4B1804E2.9070807@doxdesk.com>
References: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>
	<4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com>
Message-ID: <1D42E723-CBD1-46B3-A1D0-53CA126CC6E2@fuhm.net>

On Dec 3, 2009, at 1:35 PM, And Clover wrote:
> Manlio Perillo wrote:
> 
>> However what about URI (that is, for PATH_INFO and the like)?
>> For URI (if I remember correctly) the suggested encoding is UTF-8, so
>> URLS should be decoded using
> 
>>  url.decode('utf-8', 'surrogateescape')
> 
>> Is this correct?
> 
> The currently-discussed proposal is ISO-8859-1, allowing the real bytes to be trivially extracted. This is consistent with the other headers and would be my preferred approach.

Right, for WSGI 1.1 on Python 3.x, 8859-1 strings is the plan. Other, more ideologically pure options can be discussed for an incompatible revision of WSGI (e.g. the hypothetical 2.0).

BTW: I hope to have a first draft of the changes by Monday. (But don't beat up on me if it's delayed; I am working on it.)

James

From and-py at doxdesk.com  Thu Dec  3 20:11:54 2009
From: and-py at doxdesk.com (And Clover)
Date: Thu, 03 Dec 2009 20:11:54 +0100
Subject: [Web-SIG] HTTP headers encoding
In-Reply-To: <4B17CFE4.3020504@libero.it>
References: <4B17CFE4.3020504@libero.it>
Message-ID: <4B180D7A.9000708@doxdesk.com>

Manlio Perillo wrote:

> I have written a simple WSGI application that asks authentication
> credentials

Ho ho! This is another area that is Completely Broken Everywhere. It's 
actually a similar situation to the cookies:

- Opera and Chrome send non-ASCII cookie characters in UTF-8.
- IE encodes using the system codepage (which can never be UTF-8),
   mangling any characters that don't fit in the codepage through the
   traditional Windows 'similar replacement character' scheme.
- Mozilla uses the low byte of each UTF-16 code point (so ISO-8859-1
   gets through but everything else is mangled)
- Safari uses ISO-8859-1, and refuses to send any cookie containing
   characters outside the 8859-1 repertoire.
- Konqueror uses ISO-8859-1, and replaces any non-8859-1 character
   with a question mark.

The HTTP standard has nothing to say about the encoding in use *inside* 
the base64-encoded Authorization byte-string token. It's anyone's guess, 
and every browser has guessed differently. (Safari here is at least 
slightly better than its behaviour with the cookies.)

 > (and I suspect that [IE] always use this encoding, instead of
 > iso-8859-1).

It will certainly never send ISO-8859-1, but what it does send is locale 
dependent. Type an e-acute in your username on a Western machine and 
it'll send one byte sequence; type the same thing on an Eastern European 
Windows install and you'll get something quite different.

> Firefox (Iceweasel 3.0.14, Linux Debian Squeeze) sends me a '\xac'

> I don't know where \xac come from

It's the low byte of UCS-2 codepoint U+20AC (EURO SIGN). Firefox simply 
discards the top 8 bits of each codepoint.

> Unfortunately I can not test with IE 7 and 8.

The behaviour has not changed.

 > This is really a mess.

Isn't it.

 > How is authorization username handled in common WSGI frameworks?

No-one supports non-ASCII characters in Authentication. Most web authors 
simply move to cookies instead.

-- 
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/


From henry at precheur.org  Thu Dec  3 20:25:25 2009
From: henry at precheur.org (Henry Precheur)
Date: Thu, 3 Dec 2009 11:25:25 -0800
Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
In-Reply-To: <4B1804E2.9070807@doxdesk.com>
References: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>
	<4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com>
Message-ID: <20091203192525.GA3792@banane.novuscom.net>

On Thu, Dec 03, 2009 at 07:35:14PM +0100, And Clover wrote:
> >I don't know what the HTTP/Cookie spec says about this.
> 
> The traditional interpretation of RFC2616 is that headers are ISO-8859-1.
> 
> You will notice that no browser correctly follows this.

The RFC 2109 & 2965 say that a cookie's value can be anything:

> The VALUE is opaque to the user agent and may be anything the origin
> server chooses to send, possibly in a server-selected printable ASCII
> encoding.

Theoricaly you could put something like: 'foo\n\0bar' in a cookie.

Also a cookie can include comments which have to be encoded using ...
UTF-8:

> Comment=value
>   OPTIONAL.  Because cookies can be used to derive or store
>   private information about a user, the value of the Comment
>   attribute allows an origin server to document how it intends to
>   use the cookie.  The user can inspect the information to decide
>   whether to initiate or continue a session with this cookie.
>   Characters in value MUST be in UTF-8 encoding.

-- 
  Henry Pr?cheur

From henry at precheur.org  Thu Dec  3 20:26:28 2009
From: henry at precheur.org (Henry Precheur)
Date: Thu, 3 Dec 2009 11:26:28 -0800
Subject: [Web-SIG] HTTP headers encoding
In-Reply-To: <4B17E2BB.9040806@libero.it>
References: <4B17CFE4.3020504@libero.it> <4B17E2BB.9040806@libero.it>
Message-ID: <20091203192628.GA18929@banane.novuscom.net>

On Thu, Dec 03, 2009 at 05:09:31PM +0100, Manlio Perillo wrote:
> This is really a mess.

RFC 2617 doesn't specify any encoding for its headers, so it should be
latin-1 everywhere. But on the web nobody respect standards.

> How is authorization username handled in common WSGI frameworks?

As far as I know, they don't handle this. They just return the string
without dealing with the encoding issues.

I think there is no correct way of handling this, because 99% of
username/password contain only ascii characters. A possible 'workaround'
would be to limit yourself to the ascii charset. If you get a non-ascii
character raise an Exception.

-- 
  Henry Pr?cheur

From manlio_perillo at libero.it  Thu Dec  3 20:33:19 2009
From: manlio_perillo at libero.it (Manlio Perillo)
Date: Thu, 03 Dec 2009 20:33:19 +0100
Subject: [Web-SIG] HTTP headers encoding
In-Reply-To: <20091203192628.GA18929@banane.novuscom.net>
References: <4B17CFE4.3020504@libero.it> <4B17E2BB.9040806@libero.it>
	<20091203192628.GA18929@banane.novuscom.net>
Message-ID: <4B18127F.8070606@libero.it>

Henry Precheur ha scritto:
> [...]
>> How is authorization username handled in common WSGI frameworks?
> 
> As far as I know, they don't handle this. They just return the string
> without dealing with the encoding issues.
> 
> I think there is no correct way of handling this, because 99% of
> username/password contain only ascii characters. A possible 'workaround'
> would be to limit yourself to the ascii charset. If you get a non-ascii
> character raise an Exception.
> 

Right now I'm doing a: username.decode('us-ascii', 'replace')


Regards  Manlio

From henry at precheur.org  Thu Dec  3 20:43:32 2009
From: henry at precheur.org (Henry Precheur)
Date: Thu, 3 Dec 2009 11:43:32 -0800
Subject: [Web-SIG] HTTP headers encoding
In-Reply-To: <4B18127F.8070606@libero.it>
References: <4B17CFE4.3020504@libero.it> <4B17E2BB.9040806@libero.it>
	<20091203192628.GA18929@banane.novuscom.net>
	<4B18127F.8070606@libero.it>
Message-ID: <20091203194332.GA4875@banane.novuscom.net>

On Thu, Dec 03, 2009 at 08:33:19PM +0100, Manlio Perillo wrote:
> Right now I'm doing a: username.decode('us-ascii', 'replace')

Or like most frameworks you could let the application author deal with
the problem, just pass the raw strings to the application.

-- 
  Henry Pr?cheur

From manlio_perillo at libero.it  Thu Dec  3 21:15:06 2009
From: manlio_perillo at libero.it (Manlio Perillo)
Date: Thu, 03 Dec 2009 21:15:06 +0100
Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
In-Reply-To: <4B1804E2.9070807@doxdesk.com>
References: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>	<4B179937.8070305@libero.it>
	<4B1804E2.9070807@doxdesk.com>
Message-ID: <4B181C4A.7010307@libero.it>

And Clover ha scritto:
> Manlio Perillo wrote:
> 
>> However what about URI (that is, for PATH_INFO and the like)?
>> For URI (if I remember correctly) the suggested encoding is UTF-8, so
>> URLS should be decoded using
> 
>>   url.decode('utf-8', 'surrogateescape')
> 
>> Is this correct?
> 
> The currently-discussed proposal is ISO-8859-1, allowing the real bytes
> to be trivially extracted. This is consistent with the other headers and
> would be my preferred approach.
> 

There is something that I don't understand.

Some HTTP headers, like Accept-Language, contains data described as
`token`, where:

token          = 1*<any CHAR except CTLs or separators>

So a token, IMHO, is an opaque string, and it SHOULD not decoded.
In Python 3.x it SHOULD be a byte string.

Text content is described as `TEXT`, where:

The TEXT rule is only used for descriptive field contents and values
that are not intended to be interpreted by the message parser. Words
of *TEXT MAY contain characters from character sets other than ISO-
8859-1 [22] only when encoded according to the rules of RFC 2047
[14].

    TEXT           = <any OCTET except CTLs,
                     but including LWS>


The only type of data where TEXT can be used is `quoted-string`.

A `quoted-string` only appears in well specified portions of an header.
So, IMHO, it is *not* correct for a WSGI middleware, to return all HTTP
headers as Unicode strings.

This is up to the application/framework, that must parse each header,
split it in component and handle them as more appropriate (as byte
string, Unicode string or instance of some other data type).


> [...]


Regards   Manlio

From henry at precheur.org  Thu Dec  3 23:02:26 2009
From: henry at precheur.org (Henry Precheur)
Date: Thu, 3 Dec 2009 14:02:26 -0800
Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
In-Reply-To: <4B181C4A.7010307@libero.it>
References: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>
	<4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com>
	<4B181C4A.7010307@libero.it>
Message-ID: <20091203220226.GA15382@banane.novuscom.net>

On Thu, Dec 03, 2009 at 09:15:06PM +0100, Manlio Perillo wrote:
> There is something that I don't understand.
> 
> Some HTTP headers, like Accept-Language, contains data described as
> `token`, where:
> 
> token          = 1*<any CHAR except CTLs or separators>
> 
> So a token, IMHO, is an opaque string, and it SHOULD not decoded.
> In Python 3.x it SHOULD be a byte string.

I think this is more an issue that frameworks should deal with. By
decoding every headers value to latin-1:

* It keeps WSGI simple. Simple is good.

* WSGI sticks to what RFC 2616 (Hypertext Transfer Protocol -- HTTP/1.1)
  says. WSGI is about HTTP, but that doesn't necessarily includes all
  other standards extending HTTP.

* It's possible to convert latin-1 strings to bytes without losing data.

-- 
  Henry Pr?cheur

From and-py at doxdesk.com  Fri Dec  4 00:50:27 2009
From: and-py at doxdesk.com (And Clover)
Date: Fri, 04 Dec 2009 00:50:27 +0100
Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
In-Reply-To: <4B181C4A.7010307@libero.it>
References: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>	<4B179937.8070305@libero.it>
	<4B1804E2.9070807@doxdesk.com> <4B181C4A.7010307@libero.it>
Message-ID: <4B184EC3.9070804@doxdesk.com>

Manlio Perillo wrote:

> Words of *TEXT MAY contain characters from character sets other than
> ISO-8859-1 [22] only when encoded according to the rules of RFC 2047

Yeah, this is, unfortunately, a lie. The rules of RFC 2047 apply only to 
RFC*822-family 'atoms' and not elsewhere; indeed, RFC2047 itself 
specifically denies that an encoded-word can go in a quoted-string.

RFC2047 encoded-words are not on-topic in an HTTP header(*); this has 
been confirmed by newer development work on HTTPbis by Reschke et al. 
(http://tools.ietf.org/wg/httpbis/).

The "correct" way of escaping header parameters in an RFC*822-family 
protocol would be RFC2231's complex encoding scheme, but HTTP is 
explicitly not an 822-family protocol despite sharing many of the same 
constructs. See 
http://tools.ietf.org/html/draft-reschke-rfc2231-in-http-06 for a 
strategy for how 2231 should interact with HTTP, but note that for now 
RFC2231-in-HTTP simply does not exist in any deployed tools.

So for now there is basically nothing useful WSGI can do other than 
provide direct, byte-oriented (even if wrapped in 8859-1 unicode 
strings) access to headers.

-- 
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/


From manlio_perillo at libero.it  Fri Dec  4 10:17:09 2009
From: manlio_perillo at libero.it (Manlio Perillo)
Date: Fri, 04 Dec 2009 10:17:09 +0100
Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
In-Reply-To: <20091203220226.GA15382@banane.novuscom.net>
References: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>
	<4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com>
	<4B181C4A.7010307@libero.it>
	<20091203220226.GA15382@banane.novuscom.net>
Message-ID: <4B18D395.4080801@libero.it>

Henry Precheur ha scritto:
> On Thu, Dec 03, 2009 at 09:15:06PM +0100, Manlio Perillo wrote:
>> There is something that I don't understand.
>>
>> Some HTTP headers, like Accept-Language, contains data described as
>> `token`, where:
>>
>> token          = 1*<any CHAR except CTLs or separators>
>>
>> So a token, IMHO, is an opaque string, and it SHOULD not decoded.
>> In Python 3.x it SHOULD be a byte string.
> 
> I think this is more an issue that frameworks should deal with. By
> decoding every headers value to latin-1:
> 
> * It keeps WSGI simple. Simple is good.
> 

It is just as simple as using byte strings, IMHO.
It is not simple, it is convenient because of (if I understand
correctly) how code is converted by 2to3.

> * WSGI sticks to what RFC 2616 (Hypertext Transfer Protocol -- HTTP/1.1)
>   says. WSGI is about HTTP, but that doesn't necessarily includes all
>   other standards extending HTTP.
> 

HTTP never says to consided whole headers as latin-1 text, IMHO.

> * It's possible to convert latin-1 strings to bytes without losing data.
> 

Yes, but it is quite stupid to first convert to Unicode and then convert
again to byte string.

It it true, however, that this does not happen often; but only for:

- WSGI applications that implement an HTTP proxy
- WSGI applications that needs to support HTTP Digest Authentication
- WSGI applications that store encoded data in cookies


Regards  Manlio

From manlio_perillo at libero.it  Fri Dec  4 10:46:16 2009
From: manlio_perillo at libero.it (Manlio Perillo)
Date: Fri, 04 Dec 2009 10:46:16 +0100
Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
In-Reply-To: <4B184EC3.9070804@doxdesk.com>
References: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>	<4B179937.8070305@libero.it>	<4B1804E2.9070807@doxdesk.com>
	<4B181C4A.7010307@libero.it> <4B184EC3.9070804@doxdesk.com>
Message-ID: <4B18DA68.7070702@libero.it>

And Clover ha scritto:
> Manlio Perillo wrote:
> 
>> Words of *TEXT MAY contain characters from character sets other than
>> ISO-8859-1 [22] only when encoded according to the rules of RFC 2047
> 
> Yeah, this is, unfortunately, a lie. The rules of RFC 2047 apply only to
> RFC*822-family 'atoms' and not elsewhere; indeed, RFC2047 itself
> specifically denies that an encoded-word can go in a quoted-string.
> 
> RFC2047 encoded-words are not on-topic in an HTTP header(*); this has
> been confirmed by newer development work on HTTPbis by Reschke et al.
> (http://tools.ietf.org/wg/httpbis/).
> 

Thanks.
HTTPbis seems to fix all these problems:

"Historically, HTTP has allowed field content with text in the ISO-
8859-1 [ISO-8859-1] character encoding and supported other character
sets only through use of [RFC2047] encoding.  In practice, most HTTP
header field values use only a subset of the US-ASCII character
encoding [USASCII].  Newly defined header fields SHOULD limit their
field values to US-ASCII characters.  Recipients SHOULD treat other
(obs-text) octets in field content as opaque data."


This is the new rule for `quoted-string`:

quoted-string  = DQUOTE *( qdtext / quoted-pair ) DQUOTE
qdtext         = OWS / %x21 / %x23-5B / %x5D-7E / obs-text
               ; OWS / <VCHAR except DQUOTE and "\"> / obs-text
obs-text       = %x80-FF

quoted-pair    = "\" ( WSP / VCHAR / obs-text )


> The "correct" way of escaping header parameters in an RFC*822-family
> protocol would be RFC2231's complex encoding scheme, but HTTP is
> explicitly not an 822-family protocol despite sharing many of the same
> constructs. See
> http://tools.ietf.org/html/draft-reschke-rfc2231-in-http-06 for a
> strategy for how 2231 should interact with HTTP, but note that for now
> RFC2231-in-HTTP simply does not exist in any deployed tools.
> 

It seems reasonable.

> So for now there is basically nothing useful WSGI can do other than
> provide direct, byte-oriented (even if wrapped in 8859-1 unicode
> strings) access to headers.
> 

Yes, this is what I think.
I have some doubts about wrapping the headers in 8859-1 unicode strings,
but luckily there is surrogateescape.


Regards  Manlio

From henry at precheur.org  Fri Dec  4 19:28:16 2009
From: henry at precheur.org (Henry Precheur)
Date: Fri, 4 Dec 2009 10:28:16 -0800
Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
In-Reply-To: <4B18D395.4080801@libero.it>
References: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>
	<4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com>
	<4B181C4A.7010307@libero.it>
	<20091203220226.GA15382@banane.novuscom.net>
	<4B18D395.4080801@libero.it>
Message-ID: <20091204182816.GA2311@banane.novuscom.net>

On Fri, Dec 04, 2009 at 10:17:09AM +0100, Manlio Perillo wrote:
> It is just as simple as using byte strings, IMHO.

No, it's not. There were lots of dicussions regarding this on the
mailing list. One of the main issue is that the standard library
supports bytes poorly. urllib for example expects strings not bytes.

> > * WSGI sticks to what RFC 2616 (Hypertext Transfer Protocol -- HTTP/1.1)
> >   says. WSGI is about HTTP, but that doesn't necessarily includes all
> >   other standards extending HTTP.
> > 
> 
> HTTP never says to consided whole headers as latin-1 text, IMHO.

It does:

  When no explicit charset parameter is provided by the sender, media
  subtypes of the "text" type are defined to have a default charset value
  of "ISO-8859-1" when received via HTTP.

  http://tools.ietf.org/html/rfc2616#section-3.7.1

> Yes, but it is quite stupid to first convert to Unicode and then convert
> again to byte string.

99% of the time latin-1 will work. And converting from Unicode to bytes
is not costly.

6 months ago I was a big fan of bytes, but bytes create more problems
than they solve.

-- 
  Henry Pr?cheur

From manlio_perillo at libero.it  Fri Dec  4 19:40:55 2009
From: manlio_perillo at libero.it (Manlio Perillo)
Date: Fri, 04 Dec 2009 19:40:55 +0100
Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
In-Reply-To: <20091204182816.GA2311@banane.novuscom.net>
References: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>
	<4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com>
	<4B181C4A.7010307@libero.it>
	<20091203220226.GA15382@banane.novuscom.net>
	<4B18D395.4080801@libero.it>
	<20091204182816.GA2311@banane.novuscom.net>
Message-ID: <4B1957B7.6040800@libero.it>

Henry Precheur ha scritto:
> On Fri, Dec 04, 2009 at 10:17:09AM +0100, Manlio Perillo wrote:
>> It is just as simple as using byte strings, IMHO.
> 
> No, it's not. There were lots of dicussions regarding this on the
> mailing list. One of the main issue is that the standard library
> supports bytes poorly. urllib for example expects strings not bytes.
> 

I read last month discussions 3 day ago!
The quote function supports byte strings, as an example.

What are the functions that does not works with byte strings?

>>> * WSGI sticks to what RFC 2616 (Hypertext Transfer Protocol -- HTTP/1.1)
>>>   says. WSGI is about HTTP, but that doesn't necessarily includes all
>>>   other standards extending HTTP.
>>>
>> HTTP never says to consided whole headers as latin-1 text, IMHO.
> 
> It does:
> 
>   When no explicit charset parameter is provided by the sender, media
>   subtypes of the "text" type are defined to have a default charset value
>   of "ISO-8859-1" when received via HTTP.
> 
>   http://tools.ietf.org/html/rfc2616#section-3.7.1
> 

This is not correct.

First of all, HTTP never says that whole headers are of type TEXT.
Only specific components are of type TEXT.

Moreover, HTTPbis has finally clarified this; TEXT is no more used,
instead non ascii characters are to be considered opaque.

Do you really want to define the new WSGI specification to be "against"
the new (possible) HTTP spec?

Of course it will work; but since some code in the standard library
needs to be fixed (the wsgiref.util.application_uri, as an example),
maybe it is better to fix it to work with byte strings.

Just my two cents.

> [...]


Regards  Manlio

From henry at precheur.org  Fri Dec  4 20:50:09 2009
From: henry at precheur.org (Henry Precheur)
Date: Fri, 4 Dec 2009 11:50:09 -0800
Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
In-Reply-To: <4B1957B7.6040800@libero.it>
References: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>
	<4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com>
	<4B181C4A.7010307@libero.it>
	<20091203220226.GA15382@banane.novuscom.net>
	<4B18D395.4080801@libero.it>
	<20091204182816.GA2311@banane.novuscom.net>
	<4B1957B7.6040800@libero.it>
Message-ID: <20091204195009.GA5845@banane.novuscom.net>

On Fri, Dec 04, 2009 at 07:40:55PM +0100, Manlio Perillo wrote:
> What are the functions that does not works with byte strings?

Just to make things clear, I was talking about Python 3.

All the functions I tried not ending with _from_bytes raise an exception
with bytes. This includes urllib.parse.parse_qs & urllib.parse.urlparse
which are rather critical ...

> First of all, HTTP never says that whole headers are of type TEXT.
> Only specific components are of type TEXT.

If parts of a header contain latin-1 characters, that means its
encoding is latin-1 (at least partially).

> Moreover, HTTPbis has finally clarified this; TEXT is no more used,
> instead non ascii characters are to be considered opaque.

Yes, but the HTTPbis draft also says:

   Historically, HTTP has allowed field content with text in the
   ISO-8859-1 character encoding.

And WSGI is not about HTTP in a distant future, it's about HTTP right
now.

> Do you really want to define the new WSGI specification to be "against"
> the new (possible) HTTP spec?

I don't know why it would be "against" it. WSGI aims to handle HTTP in
the real world. Just because the HTTPbis spec is released wont take all
the garbage out of the web. There will still be latin-1 strings in
headers passed around for the next 10 years.

-- 
  Henry Pr?cheur

From manlio_perillo at libero.it  Fri Dec  4 21:09:35 2009
From: manlio_perillo at libero.it (Manlio Perillo)
Date: Fri, 04 Dec 2009 21:09:35 +0100
Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
In-Reply-To: <20091204195009.GA5845@banane.novuscom.net>
References: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>
	<4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com>
	<4B181C4A.7010307@libero.it>
	<20091203220226.GA15382@banane.novuscom.net>
	<4B18D395.4080801@libero.it>
	<20091204182816.GA2311@banane.novuscom.net>
	<4B1957B7.6040800@libero.it>
	<20091204195009.GA5845@banane.novuscom.net>
Message-ID: <4B196C7F.6080902@libero.it>

Henry Precheur ha scritto:
> On Fri, Dec 04, 2009 at 07:40:55PM +0100, Manlio Perillo wrote:
>> What are the functions that does not works with byte strings?
> 
> Just to make things clear, I was talking about Python 3.
> 

I know.

Unfortunately I don't have installed Python 3, I'm just reading the code.

> All the functions I tried not ending with _from_bytes raise an exception
> with bytes. This includes urllib.parse.parse_qs & urllib.parse.urlparse
> which are rather critical ...
> 

Ah, ok.
Can you show me the traceback of parse_qs? Thanks.


>> First of all, HTTP never says that whole headers are of type TEXT.
>> Only specific components are of type TEXT.
> 
> If parts of a header contain latin-1 characters, that means its
> encoding is latin-1 (at least partially).
> 

This is not completely true.

> [...]

> And WSGI is not about HTTP in a distant future, it's about HTTP right
> now.
> 
>> Do you really want to define the new WSGI specification to be "against"
>> the new (possible) HTTP spec?
> 
> I don't know why it would be "against" it.

Well, I have quoted it for this reason.
What I mean is that, IMHO:

- Using Unicode strings in WSGI is an abuse of Unicode string
- This abuse is not justified by the HTTP spec


> [...]


Regards  Manlio

From manlio_perillo at libero.it  Sun Dec  6 14:43:43 2009
From: manlio_perillo at libero.it (Manlio Perillo)
Date: Sun, 06 Dec 2009 14:43:43 +0100
Subject: [Web-SIG] CGI WSGI and Unicode
Message-ID: <4B1BB50F.1040801@libero.it>

Hi.

I'm playing with Python 3.x, current revision.

I have noted that the data in the os.environ are noe Unicode strings.

In a CGI application, HTTP headers are Unicode strings, and are decoded
using system default encoding.
In a future WSGI application, HTTP headers are Unicode strings, and are
decoded using latin-1 encoding.

In both cases, 'surrogateescape' is used.

Can this cause troubles and incompatibility problems?
I'm interested in special header handling, like cookies, that contain
opaque data.


Thanks  Manlio

From manlio_perillo at libero.it  Mon Dec  7 11:51:31 2009
From: manlio_perillo at libero.it (Manlio Perillo)
Date: Mon, 07 Dec 2009 11:51:31 +0100
Subject: [Web-SIG] CGI WSGI and Unicode
In-Reply-To: <88e286470912061736o7c1ab6b2v13aad4bc935bfb3d@mail.gmail.com>
References: <4B1BB50F.1040801@libero.it>
	<88e286470912061736o7c1ab6b2v13aad4bc935bfb3d@mail.gmail.com>
Message-ID: <4B1CDE33.3040805@libero.it>

Graham Dumpleton ha scritto:

Note: I'm sending the entire message to the mailing list.

> 2009/12/7 Manlio Perillo <manlio_perillo at libero.it>:
>> Hi.
>>
>> I'm playing with Python 3.x, current revision.
>>
>> I have noted that the data in the os.environ are noe Unicode strings.
>>
>> In a CGI application, HTTP headers are Unicode strings, and are decoded
>> using system default encoding.
>> In a future WSGI application, HTTP headers are Unicode strings, and are
>> decoded using latin-1 encoding.
>>
>> In both cases, 'surrogateescape' is used.
> 
> No, 'surrogateescape' is not necessary when using latin-1, or at least
> for variables which use latin-1.
> 

The problem is that not all browsers use latin-1.
As an example with HTTP Digest authentication.

> Use of 'surrogateescape' is only relevant in the context of some web
> servers and only relevant for specific variables, some of which aren't
> even part of set of variables which are required by WSGI.
> 
> For example, in Apache/mod_wsgi, 'surrogateescape' is used on
> DOCUMENT_ROOT and SCRIPT_FILENAME. 

What about HTTP_COOKIE?

> [...] 
>> Can this cause troubles and incompatibility problems?
>> I'm interested in special header handling, like cookies, that contain
>> opaque data.
> 
> The issues which CGI/WSGI bridge in Python 3.X has been discussed
> previously on the list. 

It seems I missed it.

> It is acknowledged that there are problems to
> be solved there, at least to extent that CGI/WSGI bridge
> implementation has to correct the encoding, and also that that may
> only be solvable in Python 3.1 onwards due to not having access to
> what encoding was use for environment variables in Python 3.0. Not
> many people care about CGI these days and so no one has been bother to
> come up with working CGI/WSGI bridge for Python 3.X.
> 

CGI is very important; there are some kind of web applications that have
problems when executing in a long running process.

As an example, I prefer to run Trac and Mercurial instances as CGI.

> Graham


Regards  Manlio

From mborch at gmail.com  Mon Dec  7 12:19:04 2009
From: mborch at gmail.com (Malthe Borch)
Date: Mon, 07 Dec 2009 12:19:04 +0100
Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
In-Reply-To: <4B184EC3.9070804@doxdesk.com>
References: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>	<4B179937.8070305@libero.it>	<4B1804E2.9070807@doxdesk.com>
	<4B181C4A.7010307@libero.it> <4B184EC3.9070804@doxdesk.com>
Message-ID: <4B1CE4A8.9040104@gmail.com>

On 12/4/09 12:50 AM, And Clover wrote:
> So for now there is basically nothing useful WSGI can do other than
> provide direct, byte-oriented (even if wrapped in 8859-1 unicode
> strings) access to headers.

You could argue that this is perhaps a good reason to replace 
``environ`` with something that interprets the headers according to how 
HTTP is actually used in the real world.

It may be that WSGI should use bytes everywhere and the recommended 
usage would be via a decorator (which could cache computations on the 
environ dictionary):

e.g. the raw application handler versus one decorated with an imaginary 
``webob`` function.

   def app(environ, start_response):
       ...

   @webob
   def app(request):
       ...

It is often said that WSGI should be practical, but in actual usage, I 
think most developers use a request/response abstraction layer.

Middlewares are usually shrink-wrapped library code that could handle a 
bytes-based environ dict (they'd have to explicitly decode the headers 
of interest).

\malthe


From graham.dumpleton at gmail.com  Mon Dec  7 12:19:42 2009
From: graham.dumpleton at gmail.com (Graham Dumpleton)
Date: Mon, 7 Dec 2009 22:19:42 +1100
Subject: [Web-SIG] CGI WSGI and Unicode
In-Reply-To: <4B1CDE33.3040805@libero.it>
References: <4B1BB50F.1040801@libero.it>
	<88e286470912061736o7c1ab6b2v13aad4bc935bfb3d@mail.gmail.com>
	<4B1CDE33.3040805@libero.it>
Message-ID: <88e286470912070319t4a9a5a4p4d765667eef312fe@mail.gmail.com>

2009/12/7 Manlio Perillo <manlio_perillo at libero.it>:
> Graham Dumpleton ha scritto:
>
> Note: I'm sending the entire message to the mailing list.
>
>> 2009/12/7 Manlio Perillo <manlio_perillo at libero.it>:
>>> Hi.
>>>
>>> I'm playing with Python 3.x, current revision.
>>>
>>> I have noted that the data in the os.environ are noe Unicode strings.
>>>
>>> In a CGI application, HTTP headers are Unicode strings, and are decoded
>>> using system default encoding.
>>> In a future WSGI application, HTTP headers are Unicode strings, and are
>>> decoded using latin-1 encoding.
>>>
>>> In both cases, 'surrogateescape' is used.
>>
>> No, 'surrogateescape' is not necessary when using latin-1, or at least
>> for variables which use latin-1.
>>
>
> The problem is that not all browsers use latin-1.
> As an example with HTTP Digest authentication.

You seem to miss one important point. When converting bytes to unicode
as latin-1, the surrogate escape mechanism never comes into play. This
is because all byte values can be represented in latin-1 due it being
a single byte encoding which preserves the original bytes intact.

>> Use of 'surrogateescape' is only relevant in the context of some web
>> servers and only relevant for specific variables, some of which aren't
>> even part of set of variables which are required by WSGI.
>>
>> For example, in Apache/mod_wsgi, 'surrogateescape' is used on
>> DOCUMENT_ROOT and SCRIPT_FILENAME.
>
> What about HTTP_COOKIE?

You trimmed part of my response which is very important. For
DOCUMENT_ROOT and SCRIPT_FILENAME they must be dealt with per the
filesystem encoding and not latin-1. If you don't, the result may not
be compatible with input to file system routines in Python 3.1 which
sort of expect file system encoding plus surrogate escape.

As I say though, those variables aren't relevant to most WSGI hosting
mechanisms and even for those which the web server provides them,
nearly all WSGI applications will not care about them. In
Apache/mod_wsgi worry about them because Apache/mod_wsgi provides
features which allow one to define Apache style handlers based on file
type where the handler for the arbitrary file type is implemented as a
WSGI application. In that case the file the URL mapped to, ie.,
SCRIPT_FILENAME, is an arbitrary file and not a WSGI script file.

In the case of HTTP_COOKIE, as far as WSGI adapter goes it just
converts it to unicode as per latin-1. So, it is washing its hands of
what to do with it because it cannot know and only WSGI application
can. Because latin-1, no surrogate escape involved. In the WSGI
application where it knows what encoding may be used then the WSGI
application can convert back to bytes and to a different encoding,
using surrogate escape if it wants to to ensure no outright error if
bytes can't be represented in that alternate encoding.

>> [...]
>>> Can this cause troubles and incompatibility problems?
>>> I'm interested in special header handling, like cookies, that contain
>>> opaque data.
>>
>> The issues which CGI/WSGI bridge in Python 3.X has been discussed
>> previously on the list.
>
> It seems I missed it.
>
>> It is acknowledged that there are problems to
>> be solved there, at least to extent that CGI/WSGI bridge
>> implementation has to correct the encoding, and also that that may
>> only be solvable in Python 3.1 onwards due to not having access to
>> what encoding was use for environment variables in Python 3.0. Not
>> many people care about CGI these days and so no one has been bother to
>> come up with working CGI/WSGI bridge for Python 3.X.
>>
>
> CGI is very important; there are some kind of web applications that have
> problems when executing in a long running process.
>
> As an example, I prefer to run Trac and Mercurial instances as CGI.

Yes I agree that there are some valid uses of CGI/WSGI bridge although
those two aren't the ones I would have in mind.

For the record, CGI/WSGI adapters should also protect the original
stdin/stdout so WSGI application doesn't cause problems by using
'print' or do other odd stuff with input. I haven't seen a single
CGI/WSGI adapter which does it in a way that I would say is correct,
or at least robust against users doing stupid things, so encoding
issues aren't the only thing where CGI/WSGI adapters need work.

Graham

From arw1961 at yahoo.com  Mon Dec  7 21:23:18 2009
From: arw1961 at yahoo.com (Aaron Watters)
Date: Mon, 7 Dec 2009 12:23:18 -0800 (PST)
Subject: [Web-SIG] CGI WSGI and Unicode
In-Reply-To: <88e286470912070319t4a9a5a4p4d765667eef312fe@mail.gmail.com>
Message-ID: <106806.11256.qm@web32008.mail.mud.yahoo.com>


--- On Mon, 12/7/09, Graham Dumpleton <graham.dumpleton at gmail.com> wrote:

> For the record, CGI/WSGI adapters should also protect the
> original
> stdin/stdout so WSGI application doesn't cause problems by
> using
> 'print' or do other odd stuff with input. I haven't seen a
> single
> CGI/WSGI adapter which does it in a way that I would say is
> correct,
> or at least robust against users doing stupid things...

"There is no fool proof software: fools are too clever"
"Doctor, it hurts when I do this."  "Don't do that."

Some words of wisdom from folklore... (or if anyone knows
the correct attribution, please inform).
   -- Aaron Watters
      http://listtree.appspot.com
      http://whiffdoc.appspot.com

===
an apple every 8 hours
will keep 3 doctors away.  -- kliban


From and-py at doxdesk.com  Tue Dec  8 16:27:41 2009
From: and-py at doxdesk.com (And Clover)
Date: Tue, 08 Dec 2009 16:27:41 +0100
Subject: [Web-SIG] CGI WSGI and Unicode
In-Reply-To: <4B1BB50F.1040801@libero.it>
References: <4B1BB50F.1040801@libero.it>
Message-ID: <4B1E706D.3000502@doxdesk.com>

Manlio Perillo wrote:

> In a CGI application, HTTP headers are Unicode strings, and are decoded
> using system default encoding.

> In a future WSGI application, HTTP headers are Unicode strings, and are
> decoded using latin-1 encoding.

Yes. As proposed, WSGI 1.1 would require CGI-to-WSGI handler to undo the 
decode stage caused by reading environ using the default encoding. At 
least this is now reliably possible thanks to surrogateescape.

PATH_INFO is the only really important HTTP-related environment variable 
for Unicode. Potentially SCRIPT_NAME could also be significant in 
relation to PATH_INFO. The HTTP headers don't massively matter because 
there are almost never any non-ASCII characters in them.

Previously the job of undoing an unwanted decode step was dumped on 
whatever read the PATH_INFO; usually a routing component, which would 
have to make guesses with typically poor results. The CGI adapter is in 
a much better place to do it, being closer to the server.

 > The problem is that not all browsers use latin-1.

Not WSGI's problem. WSGI will deliver bytes encoded into Unicode 
strings, not ready-to-use Unicode strings. It is up to the application 
to decide how they want to handle those bytes; maybe they want Latin-1 
and can do nothing, maybe they want to recode to UTF-8, maybe something 
else completely. No solution satisfies every app so there is always 
going to have to be a recode step somewhere.

An application that doesn't want to think about this will use a 
framework that does it for them.

 > What about HTTP_COOKIE?

For what it's worth, the choice of Latin-1 here results in the 'right' 
Unicode string for more browsers than any other potential encoding.

In any case as previously discussed, non-ASCII cookies are already 
totally broken everywhere and hence used by no-one.

-- 
And Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/


From lavendula6654 at gmail.com  Fri Dec 11 17:54:08 2009
From: lavendula6654 at gmail.com (Elaine Haight)
Date: Fri, 11 Dec 2009 08:54:08 -0800
Subject: [Web-SIG] Software Development Courses
Message-ID: <3652e3600912110854w4b010078l53a493b31ce22d2@mail.gmail.com>

Foothill College is offering two courses of interest to web application
software developers: Ajax and Python. These 11-week courses are held from
January through March. The Ajax class is entirely online, and the Python
class meets Thursday evenings at the Middlefield campus in Palo Alto.

 ?Application Software Development with Ajax? is a course designed for
students who are already familiar with some type of programming, and have
introductory knowledge of JavaScript and html. For more information, go to:
http://www.foothill.edu/schedule/schedule.php
and choose Department: ?COIN?, quarter: ?Winter 2010?, and course number
?71?.

?Introduction to Python Programming? meets Thursday evenings and is also
designed for students who are familiar with some type of programming. The
instructor is Marilyn Davis. For more information or to register, go to:

http://www.foothill.edu/schedule/schedule.php
and choose Department: ?CIS?, quarter: ?Winter 2010?, and course number
?68K?.

If you would like to sign up for a class, please register beforehand by
going to:
http://www.foothill.fhda.edu/reg/index.php
If you do not register ahead of time, the class you want may be cancelled!

If you have questions, you can contact:
h a i g h t E l a i n e AT f o o t h i l l . e d u
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20091211/363d6794/attachment.htm>

From orsenthil at gmail.com  Sun Dec 20 19:08:19 2009
From: orsenthil at gmail.com (Senthil Kumaran)
Date: Sun, 20 Dec 2009 23:38:19 +0530
Subject: [Web-SIG] [RFC] urllib2 requests history + HEAD support
Message-ID: <20091220180819.GB4385@ubuntu.ubuntu-domain>

I need your opinion on this request. 
<http://bugs.python.org/issue1673007>

Python Standard Library module urllib2 has support GET and POST.
There was a feature request to add support for HEAD requests.

While that is valid feature request, there was suggestion to include a
history of the requests in the module.  I don't find any references in
the RFCS for any such requirement to maintain a history of requests.

Do you have any opinion on whether is it a good idea to have history
of requests in the urllib2 module? I personally feel that history of
requests can be easier tracked by the clients.

-- 
Senthil


On Sun, Dec 20, 2009 at 05:59:48PM +0000, Senthil Kumaran wrote:
> 
> Senthil Kumaran <orsenthil at gmail.com> added the comment:
> 
> Having a HEAD request for urllib2 might be a good idea. I shall use this
> patch to add the functionality.
> 
> But, having a history support in the urllib2 module is not a good idea
> IMO. It is best left to the clients which might use urllib2.
> 
> ----------
> 
> _______________________________________
> Python tracker <report at bugs.python.org>
> <http://bugs.python.org/issue1673007>
> _______________________________________

-- 
Senthil
Shannon's Observation:
	Nothing is so frustrating as a bad situation that is beginning to
	improve.

From henry at precheur.org  Mon Dec 21 19:24:38 2009
From: henry at precheur.org (Henry Precheur)
Date: Mon, 21 Dec 2009 18:24:38 +0000
Subject: [Web-SIG] [RFC] urllib2 requests history + HEAD support
In-Reply-To: <20091220180819.GB4385@ubuntu.ubuntu-domain>
References: <20091220180819.GB4385@ubuntu.ubuntu-domain>
Message-ID: <20091221182438.GA27899@li60-23.members.linode.com>

On Sun, Dec 20, 2009 at 11:38:19PM +0530, Senthil Kumaran wrote:
> I need your opinion on this request. 
> <http://bugs.python.org/issue1673007>
> 
> Python Standard Library module urllib2 has support GET and POST.
> There was a feature request to add support for HEAD requests.

It would be nice to have other methods too, like PUT & DELETE:

  http://tools.ietf.org/html/rfc2616#page-52

> While that is valid feature request, there was suggestion to include a
> history of the requests in the module.  I don't find any references in
> the RFCS for any such requirement to maintain a history of requests.
> 
> Do you have any opinion on whether is it a good idea to have history
> of requests in the urllib2 module? I personally feel that history of
> requests can be easier tracked by the clients.

This should be done by the client.

-- 
  Henry Pr?cheur

From orsenthil at gmail.com  Tue Dec 22 01:43:32 2009
From: orsenthil at gmail.com (Senthil Kumaran)
Date: Tue, 22 Dec 2009 06:13:32 +0530
Subject: [Web-SIG] [RFC] urllib2 requests history + HEAD support
In-Reply-To: <20091221182438.GA27899@li60-23.members.linode.com>
References: <20091220180819.GB4385@ubuntu.ubuntu-domain>
	<20091221182438.GA27899@li60-23.members.linode.com>
Message-ID: <20091222004331.GA5669@ubuntu.ubuntu-domain>

On Mon, Dec 21, 2009 at 06:24:38PM +0000, Henry Precheur wrote:
> On Sun, Dec 20, 2009 at 11:38:19PM +0530, Senthil Kumaran wrote:
> > There was a feature request to add support for HEAD requests.
> 
> It would be nice to have other methods too, like PUT & DELETE:
> 
>   http://tools.ietf.org/html/rfc2616#page-52

Yes, I agree. Methods like PUT & DELETE also makes sense in urllib2.
Folks currently wrap those around httplib. 
HEAD can be implement in a straight forward way using httplib, but if
Request has a method parameter, which takes HEAD,PUT or DELETE and
behaves accordingly, that would make it complete.


And as expected, many voted down -1 on history support.  ( I guess
web-sig defaults the reply-to To: rather than List:)

Thanks,

-- 
Senthil
"I have a bone to pick, and a few to break."
		-- Anonymous

From mborch at gmail.com  Mon Dec  7 12:19:36 2009
From: mborch at gmail.com (Malthe Borch)
Date: Mon, 07 Dec 2009 11:19:36 -0000
Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec
In-Reply-To: <4B184EC3.9070804@doxdesk.com>
References: <EE57FCC1-02B2-4200-BAD7-9736C4A1343D@fuhm.net>	<4B179937.8070305@libero.it>	<4B1804E2.9070807@doxdesk.com>
	<4B181C4A.7010307@libero.it> <4B184EC3.9070804@doxdesk.com>
Message-ID: <4B1CE4A8.9040104@gmail.com>


On 12/4/09 12:50 AM, And Clover wrote:
> So for now there is basically nothing useful WSGI can do other than
> provide direct, byte-oriented (even if wrapped in 8859-1 unicode
> strings) access to headers.

You could argue that this is perhaps a good reason to replace 
``environ`` with something that interprets the headers according to how 
HTTP is actually used in the real world.

It may be that WSGI should use bytes everywhere and the recommended 
usage would be via a decorator (which could cache computations on the 
environ dictionary):

e.g. the raw application handler versus one decorated with an imaginary 
``webob`` function.

   def app(environ, start_response):
       ...

   @webob
   def app(request):
       ...

It is often said that WSGI should be practical, but in actual usage, I 
think most developers use a request/response abstraction layer.

Middlewares are usually shrink-wrapped library code that could handle a 
bytes-based environ dict (they'd have to explicitly decode the headers 
of interest).

\malthe


From tseaver at palladion.com  Sun Dec 27 15:26:14 2009
From: tseaver at palladion.com (Tres Seaver)
Date: Sun, 27 Dec 2009 09:26:14 -0500
Subject: [Web-SIG] Future of WSGI
In-Reply-To: <c66fb64e0911241408y66b51dbdof2e9e77482d060b9@mail.gmail.com>
References: <4B0BA030.5010201@gmail.com>	<b654cd2e0911240944vd976902g67ddf51dd309dca3@mail.gmail.com>	<4B0C4FFF.5070305@gmail.com>	<b654cd2e0911241335p601063e5v917d05779e9d8ad5@mail.gmail.com>	<c66fb64e0911241340o5306a60aw9f49a6a035b109e0@mail.gmail.com>	<b654cd2e0911241343m286af8bbi3704bd0d0da10cf8@mail.gmail.com>	<c66fb64e0911241350r1459b6aekca59d63377ce80e7@mail.gmail.com>	<b654cd2e0911241351h587423aw900d60c4e5e882ee@mail.gmail.com>
	<c66fb64e0911241408y66b51dbdof2e9e77482d060b9@mail.gmail.com>
Message-ID: <hh7qq4$2nu$1@ger.gmane.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Malthe Borch wrote:
> 
> 2009/11/24 Ian Bicking <ianb at colorstudy.com>:
>> Why does this matter?
> 
> It's all convention, but the CGI interpretation was to read the HTTP
> request line by line until a blank line came and that was the
> environment. Everything after that is the body.

"Headers", not environment:  the CGI environment is literally the
os.environ set up by the CGI parent process before forking and execing
the script.


Tres.
- --
===================================================================
Tres Seaver          +1 540-429-0999          tseaver at palladion.com
Palladion Software   "Excellence by Design"    http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAks3boYACgkQ+gerLs4ltQ5coACg0ijXgG1wy1BdNnPzN2Jm2FLG
1R0Anj0/o6zwjtatFERoQ2HS3BOgyVEA
=RhAH
-----END PGP SIGNATURE-----