From and-py at doxdesk.com Tue Dec 1 01:44:48 2009 From: and-py at doxdesk.com (And Clover) Date: Tue, 01 Dec 2009 01:44:48 +0100 Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec In-Reply-To: <88e286470911282028o5849c853od3b8239cc59f8d00@mail.gmail.com> References: <88e286470911271327p24dc978at5ee46e3ad1c99220@mail.gmail.com> <88e286470911281944s1a926ccaq600682e8aa573912@mail.gmail.com> <88e286470911282028o5849c853od3b8239cc59f8d00@mail.gmail.com> Message-ID: <4B146700.3010608@doxdesk.com> Graham Dumpleton wrote: > Answering my own question, it is actually obvious that it has to be > called (1, 0). This is because wsgiref in Python 3.X already calls it > (1, 0) and don't have much choice to be in agreement with that. wsgiref.simple_server in Python 3 to date is not something that anyone should worry about being compatible with. It is a 2to3 hack that cannot meaningfully claim to represent wsgi version anything. Careless use of urllib.parse.unquote causes 3.0's simple_server not to work at all, and 3.1's to mangle the path by treating it as UTF-8 instead of ISO-8859-1, as 'WSGI 1.1' proposed and mod_wsgi (and even mod_cgi via wsgiref.CGIHandler) delivered. Yes, I'm always going on about Unicode paths. I'm fed up of shipping apps with a page-long deployment note about fixing them. It pains me that in so many years both this and "What do we do about Python 3?" still haven't been addressed. mod_wsgi 3.0 already has more traction than wsgiref 3.1 and I would prefer not to see more farcical reverse-progress at this point. For what it's worth my responses on the issues of this thread. But at this point I really just want a BDFL to just come and do it, whatever it is. A new WSGI, whatever the version number, is massively overdue. >> 1. The 'readline()' function of 'wsgi.input' may optionally take a size hint. Yes. Obviously. Bad practice but unavoidable now. Should have been a 1.0 amendment a long time ago. >> 2. The 'wsgi.input' must provide an empty string as end of input stream marker. >> 3. The size argument to 'read()' function of 'wsgi.input' would be optional and if not supplied the function would return all available request content. >> 4. The 'wsgi.file_wrapper' supplied by the WSGI adapter must honour the Content-Length response header and must only return from the file that amount of content. +0. Seems reasonable but don't massively care. Presumably an application must refuse to run on 1.0 if it requires these behaviours? >> 5. Any WSGI application or middleware should not return more data than specified by the Content-Length response header if defined. >> 6. The WSGI adapter must not pass on to the server any data above what the Content-Length response header defines if supplied. Yes. -- And Clover mailto:and at doxdesk.com http://www.doxdesk.com/ From foom at fuhm.net Tue Dec 1 02:41:39 2009 From: foom at fuhm.net (James Y Knight) Date: Mon, 30 Nov 2009 20:41:39 -0500 Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec In-Reply-To: <01CC6582-B272-4CC8-B87A-95B683965FB2@fuhm.net> References: <88e286470911271327p24dc978at5ee46e3ad1c99220@mail.gmail.com> <88e286470911281944s1a926ccaq600682e8aa573912@mail.gmail.com> <01CC6582-B272-4CC8-B87A-95B683965FB2@fuhm.net> Message-ID: On Nov 29, 2009, at 12:40 AM, James Y Knight wrote: > The next step here is clearly for someone to redraft the changes as a diff against PEP 333. If you do not have any interest in being that person, please make that clear, so someone else can step up to do so. Okay, not sensing any other volunteers here...I guess it's all me. The intention of this spec update is to be compatible with existing middleware/applications when running on Python 2.X. Apps/middleware running on python 3.X require changes in any case, and this specification will tell them exactly what to expect. That Python 3.X middleware and WSGI adapters will have to deal with both bytestrings and unicode strings in many parts of the API (output status code, output headers, output response iterable/write callback) will add some complexity, but that's life. Any WSGI implementations on Python 3.X claiming compliance to WSGI 1.0 are most likely broken, and its behavior cannot be relied upon. Too bad about wsgiref. As self-appointed author, I am going to take a stand and say that both the python3-related string-type specifications, and the additional requirements except #3 (read() with no-args) and #4 (file_wrapper looking at Content-Length), will be included. And it will be called WSGI 1.1. Back to the list of "extra requirements": #1: (readline with an arg) must be included, despite the potential for breakage. That ship has already sailed, the breakage has already occurred, it's already required. Disagreement here really is of no consequence. #2: (wsgi.input() must return EOF at EOF): I do not believe will break any middleware. It will require some changes in some WSGI adapter implementations, but that's acceptable. If you have a real-life example of middleware that would break here, show it. So this will be included. #3 is not actually required for anything; at best it's an extra convenience; repeatedly reading until EOF will work just as well. Furthermore, the API change has the potential to break some middleware in Python 2.X, so I'll take the safe road and not make the change. The purpose behind #4 is essentially included in #6, and so is not needed as a separate requirement. #5 and #6 are uncontroversial and of no impact to an already-correct implementation. They will be included. I'll send a diff of the actual wording changes once I've written it. James From manlio_perillo at libero.it Thu Dec 3 11:55:51 2009 From: manlio_perillo at libero.it (Manlio Perillo) Date: Thu, 03 Dec 2009 11:55:51 +0100 Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec In-Reply-To: References: Message-ID: <4B179937.8070305@libero.it> James Y Knight ha scritto: > I move to bless mod_wsgi's definition of WSGI 1.1 [1] > [...] > > [1] http://code.google.com/p/modwsgi/wiki/SupportForPython3X Hi. Just a few questions. It is true that HTTP headers can be encoded assuming latin-1; and they can be encoded using PEP 383. However what about URI (that is, for PATH_INFO and the like)? For URI (if I remember correctly) the suggested encoding is UTF-8, so URLS should be decoded using url.decode('utf-8', 'surrogateescape') Is this correct? Now another question. Let's consider the `wsgiref.util.application_uri` function def application_uri(environ): url = environ['wsgi.url_scheme']+'://' from urllib.parse import quote if environ.get('HTTP_HOST'): url += environ['HTTP_HOST'] else: url += environ['SERVER_NAME'] if environ['wsgi.url_scheme'] == 'https': if environ['SERVER_PORT'] != '443': url += ':' + environ['SERVER_PORT'] else: if environ['SERVER_PORT'] != '80': url += ':' + environ['SERVER_PORT'] url += quote(environ.get('SCRIPT_NAME') or '/') return url There is a potential problem, here, with the quote function. This function does the following: def quote(string, safe='/', encoding=None, errors=None): if isinstance(string, str): if encoding is None: encoding = 'utf-8' if errors is None: errors = 'strict' string = string.encode(encoding, errors) This means that if we use surrogateescape, the informations about original bytes is lost here. This can be easily fixed by changing the application_uri function, but this also means that a WSGI application will not work with Python 3.1.x. Finally, a question about cookies. Cookie data SHOULD be transparent to the server/gateway; however WSGI is going to assume that data is encoded in latin-1. I don't know what the HTTP/Cookie spec says about this. However, from a WSGI application point of view, the cookie data can, as an example, contain some text encoded in UTF-8; this means that the application must first encode the data: cookie_bytes = cookie.encode('latin-1', 'surrogateescape') and then decode it using UTF-8: my_cookie_data = cookie_bytes.decode('utf-8') This is a bit unreasonable, but I don't know if this is a common practice (I do this, just to make an example). Manlio Perillo From manlio_perillo at libero.it Thu Dec 3 15:49:08 2009 From: manlio_perillo at libero.it (Manlio Perillo) Date: Thu, 03 Dec 2009 15:49:08 +0100 Subject: [Web-SIG] HTTP headers encoding Message-ID: <4B17CFE4.3020504@libero.it> Hi. I'm doing some tests to try to understand how HTTP headers are encoded by browsers. I have written a simple WSGI application that asks authentication credentials and then print them on the terminal and return the data as response, as raw bytes http://paste.pocoo.org/show/154633/ Then I used some browsers to try to send an username with non ascii characters. When I try with simple characters in the iso-8859-1 charset, things works well; the data is encoded using this charset. However when I try to use some extraneus character, like Euro, there are problems. Firefox (Iceweasel 3.0.14, Linux Debian Squeeze) sends me a '\xac' I don't know where \xac come from, but it is the last byte in the utf-8 encoded Euro: '\xe2\x82\xac' Internet Explorer 6.0 sends me a '\x80' and this this the Euro characted encoded using cp1252 (and I suspect that it always use this encoding, instead of iso-8859-1). Unfortunately I can not test with IE 7 and 8. With a browser working on a terminal, like lynx, things get worse. If I enter as user name the string "??", lynx sends me '\xc3\xa0\xc3\xa8' This happens in a GNOME terminal, with an it_IT.utf8 locale. wget and curl do the same. Can someone else reproduce this? Thanks Manlio From manlio_perillo at libero.it Thu Dec 3 17:09:31 2009 From: manlio_perillo at libero.it (Manlio Perillo) Date: Thu, 03 Dec 2009 17:09:31 +0100 Subject: [Web-SIG] HTTP headers encoding In-Reply-To: <4B17CFE4.3020504@libero.it> References: <4B17CFE4.3020504@libero.it> Message-ID: <4B17E2BB.9040806@libero.it> Manlio Perillo ha scritto: > Hi. > > I'm doing some tests to try to understand how HTTP headers are encoded > by browsers. > > I have written a simple WSGI application that asks authentication > credentials and then print them on the terminal and return the data as > response, as raw bytes > http://paste.pocoo.org/show/154633/ > I'm now testing using HTTP Digest Authentication. The application is here: http://paste.pocoo.org/show/154667/ It uses my wsgix framework http://hg.mperillo.ath.cx/wsgix/ since I don't want to rewrite the entire Digest Authentication handling. As user name I use the the string "???". The results are: - Firefox does not send any request, and instead it show me the returned response body "Authentication required". This is quite strange. - Internet Explorer 6 encode the username using cp1252, as always. - Opera (10.01) encode the username using utf-8 I can not test with Konqueror, since the wsgiref server have problems with it. All these implementation are against the HTTP spec. username is a quoted string, and so it SHOULD be encoded using the default latin-1, or another charset and in this case it should be formatted as specified my MIME (unfortunately there are no examples in the HTTP spec). This is really a mess. How is authorization username handled in common WSGI frameworks? Thanks Manlio From and-py at doxdesk.com Thu Dec 3 19:35:14 2009 From: and-py at doxdesk.com (And Clover) Date: Thu, 03 Dec 2009 19:35:14 +0100 Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec In-Reply-To: <4B179937.8070305@libero.it> References: <4B179937.8070305@libero.it> Message-ID: <4B1804E2.9070807@doxdesk.com> Manlio Perillo wrote: > However what about URI (that is, for PATH_INFO and the like)? > For URI (if I remember correctly) the suggested encoding is UTF-8, so > URLS should be decoded using > url.decode('utf-8', 'surrogateescape') > Is this correct? The currently-discussed proposal is ISO-8859-1, allowing the real bytes to be trivially extracted. This is consistent with the other headers and would be my preferred approach. Python 3.1's wsgiref.simple_server, on the other hand, blindly uses urllib.unquote, which defaults to UTF-8 without surrogateescape, mangling any non-UTF-8 input. I don't really care whether UTF-8+surrogateescape or ISO-8859-1 encoding is blessed. But *something* needs to be blessed. An encoding, an alternative undecoded path_info, both, something else... just *something*. > Let's consider the `wsgiref.util.application_uri` function > There is a potential problem, here, with the quote function. Yes. wsgiref is broken in Python 3.1. Not quite as broken as it was in 3.0, but still broken. Until we can come to a Pronouncement on what WSGI *is* in Python 3, it is meaningless anyway. > Cookie data SHOULD be transparent to the server/gateway; however WSGI is > going to assume that data is encoded in latin-1. Yeah. This is no big deal because non-ASCII characters in cookies are already broken everywhere(*). Given this and other limitations on what characters can go in cookies, they are habitually encoded using ad-hoc mechanisms handled by the application (typically a round of URL-encoding). *: in particular: - Opera and Chrome send non-ASCII cookie characters in UTF-8. - IE encodes using the system codepage (which can never be UTF-8), mangling any characters that don't fit in the codepage through the traditional Windows 'similar replacement character' scheme. - Mozilla uses the low byte of each UTF-16 code point (so ISO-8859-1 gets through but everything else is mangled) - Safari refuses to send any cookie containing non-ASCII characters. > I don't know what the HTTP/Cookie spec says about this. The traditional interpretation of RFC2616 is that headers are ISO-8859-1. You will notice that no browser correctly follows this. ...sigh. -- And Clover mailto:and at doxdesk.com http://www.doxdesk.com/ -- And Clover mailto:and at doxdesk.com http://www.doxdesk.com/ From manlio_perillo at libero.it Thu Dec 3 19:52:14 2009 From: manlio_perillo at libero.it (Manlio Perillo) Date: Thu, 03 Dec 2009 19:52:14 +0100 Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec In-Reply-To: <4B1804E2.9070807@doxdesk.com> References: <4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com> Message-ID: <4B1808DE.5080705@libero.it> And Clover ha scritto: > [...] >> Cookie data SHOULD be transparent to the server/gateway; however WSGI is >> going to assume that data is encoded in latin-1. > > Yeah. This is no big deal because non-ASCII characters in cookies are > already broken everywhere(*). Given this and other limitations on what > characters can go in cookies, they are habitually encoded using ad-hoc > mechanisms handled by the application (typically a round of URL-encoding). > > *: in particular: > > - Opera and Chrome send non-ASCII cookie characters in UTF-8. > - IE encodes using the system codepage (which can never be UTF-8), > mangling any characters that don't fit in the codepage through the > traditional Windows 'similar replacement character' scheme. > - Mozilla uses the low byte of each UTF-16 code point (so ISO-8859-1 > gets through but everything else is mangled) > - Safari refuses to send any cookie containing non-ASCII characters. > Thanks for this summary. I think it should go in a wiki or in a separate document (like rationale) to the WSGI spec. However this should never happen with cookie, since cookie data is opaque to browser, and it MUST send it "as is". What you describe happen with other headers containing TEXT. And now I understand that strange behaviour of Firefox with non latin-1 strings in username, in HTTP Basic Authentication. > [...] Regards Manlio From foom at fuhm.net Thu Dec 3 20:00:27 2009 From: foom at fuhm.net (James Y Knight) Date: Thu, 3 Dec 2009 14:00:27 -0500 Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec In-Reply-To: <4B1804E2.9070807@doxdesk.com> References: <4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com> Message-ID: <1D42E723-CBD1-46B3-A1D0-53CA126CC6E2@fuhm.net> On Dec 3, 2009, at 1:35 PM, And Clover wrote: > Manlio Perillo wrote: > >> However what about URI (that is, for PATH_INFO and the like)? >> For URI (if I remember correctly) the suggested encoding is UTF-8, so >> URLS should be decoded using > >> url.decode('utf-8', 'surrogateescape') > >> Is this correct? > > The currently-discussed proposal is ISO-8859-1, allowing the real bytes to be trivially extracted. This is consistent with the other headers and would be my preferred approach. Right, for WSGI 1.1 on Python 3.x, 8859-1 strings is the plan. Other, more ideologically pure options can be discussed for an incompatible revision of WSGI (e.g. the hypothetical 2.0). BTW: I hope to have a first draft of the changes by Monday. (But don't beat up on me if it's delayed; I am working on it.) James From and-py at doxdesk.com Thu Dec 3 20:11:54 2009 From: and-py at doxdesk.com (And Clover) Date: Thu, 03 Dec 2009 20:11:54 +0100 Subject: [Web-SIG] HTTP headers encoding In-Reply-To: <4B17CFE4.3020504@libero.it> References: <4B17CFE4.3020504@libero.it> Message-ID: <4B180D7A.9000708@doxdesk.com> Manlio Perillo wrote: > I have written a simple WSGI application that asks authentication > credentials Ho ho! This is another area that is Completely Broken Everywhere. It's actually a similar situation to the cookies: - Opera and Chrome send non-ASCII cookie characters in UTF-8. - IE encodes using the system codepage (which can never be UTF-8), mangling any characters that don't fit in the codepage through the traditional Windows 'similar replacement character' scheme. - Mozilla uses the low byte of each UTF-16 code point (so ISO-8859-1 gets through but everything else is mangled) - Safari uses ISO-8859-1, and refuses to send any cookie containing characters outside the 8859-1 repertoire. - Konqueror uses ISO-8859-1, and replaces any non-8859-1 character with a question mark. The HTTP standard has nothing to say about the encoding in use *inside* the base64-encoded Authorization byte-string token. It's anyone's guess, and every browser has guessed differently. (Safari here is at least slightly better than its behaviour with the cookies.) > (and I suspect that [IE] always use this encoding, instead of > iso-8859-1). It will certainly never send ISO-8859-1, but what it does send is locale dependent. Type an e-acute in your username on a Western machine and it'll send one byte sequence; type the same thing on an Eastern European Windows install and you'll get something quite different. > Firefox (Iceweasel 3.0.14, Linux Debian Squeeze) sends me a '\xac' > I don't know where \xac come from It's the low byte of UCS-2 codepoint U+20AC (EURO SIGN). Firefox simply discards the top 8 bits of each codepoint. > Unfortunately I can not test with IE 7 and 8. The behaviour has not changed. > This is really a mess. Isn't it. > How is authorization username handled in common WSGI frameworks? No-one supports non-ASCII characters in Authentication. Most web authors simply move to cookies instead. -- And Clover mailto:and at doxdesk.com http://www.doxdesk.com/ From henry at precheur.org Thu Dec 3 20:25:25 2009 From: henry at precheur.org (Henry Precheur) Date: Thu, 3 Dec 2009 11:25:25 -0800 Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec In-Reply-To: <4B1804E2.9070807@doxdesk.com> References: <4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com> Message-ID: <20091203192525.GA3792@banane.novuscom.net> On Thu, Dec 03, 2009 at 07:35:14PM +0100, And Clover wrote: > >I don't know what the HTTP/Cookie spec says about this. > > The traditional interpretation of RFC2616 is that headers are ISO-8859-1. > > You will notice that no browser correctly follows this. The RFC 2109 & 2965 say that a cookie's value can be anything: > The VALUE is opaque to the user agent and may be anything the origin > server chooses to send, possibly in a server-selected printable ASCII > encoding. Theoricaly you could put something like: 'foo\n\0bar' in a cookie. Also a cookie can include comments which have to be encoded using ... UTF-8: > Comment=value > OPTIONAL. Because cookies can be used to derive or store > private information about a user, the value of the Comment > attribute allows an origin server to document how it intends to > use the cookie. The user can inspect the information to decide > whether to initiate or continue a session with this cookie. > Characters in value MUST be in UTF-8 encoding. -- Henry Pr?cheur From henry at precheur.org Thu Dec 3 20:26:28 2009 From: henry at precheur.org (Henry Precheur) Date: Thu, 3 Dec 2009 11:26:28 -0800 Subject: [Web-SIG] HTTP headers encoding In-Reply-To: <4B17E2BB.9040806@libero.it> References: <4B17CFE4.3020504@libero.it> <4B17E2BB.9040806@libero.it> Message-ID: <20091203192628.GA18929@banane.novuscom.net> On Thu, Dec 03, 2009 at 05:09:31PM +0100, Manlio Perillo wrote: > This is really a mess. RFC 2617 doesn't specify any encoding for its headers, so it should be latin-1 everywhere. But on the web nobody respect standards. > How is authorization username handled in common WSGI frameworks? As far as I know, they don't handle this. They just return the string without dealing with the encoding issues. I think there is no correct way of handling this, because 99% of username/password contain only ascii characters. A possible 'workaround' would be to limit yourself to the ascii charset. If you get a non-ascii character raise an Exception. -- Henry Pr?cheur From manlio_perillo at libero.it Thu Dec 3 20:33:19 2009 From: manlio_perillo at libero.it (Manlio Perillo) Date: Thu, 03 Dec 2009 20:33:19 +0100 Subject: [Web-SIG] HTTP headers encoding In-Reply-To: <20091203192628.GA18929@banane.novuscom.net> References: <4B17CFE4.3020504@libero.it> <4B17E2BB.9040806@libero.it> <20091203192628.GA18929@banane.novuscom.net> Message-ID: <4B18127F.8070606@libero.it> Henry Precheur ha scritto: > [...] >> How is authorization username handled in common WSGI frameworks? > > As far as I know, they don't handle this. They just return the string > without dealing with the encoding issues. > > I think there is no correct way of handling this, because 99% of > username/password contain only ascii characters. A possible 'workaround' > would be to limit yourself to the ascii charset. If you get a non-ascii > character raise an Exception. > Right now I'm doing a: username.decode('us-ascii', 'replace') Regards Manlio From henry at precheur.org Thu Dec 3 20:43:32 2009 From: henry at precheur.org (Henry Precheur) Date: Thu, 3 Dec 2009 11:43:32 -0800 Subject: [Web-SIG] HTTP headers encoding In-Reply-To: <4B18127F.8070606@libero.it> References: <4B17CFE4.3020504@libero.it> <4B17E2BB.9040806@libero.it> <20091203192628.GA18929@banane.novuscom.net> <4B18127F.8070606@libero.it> Message-ID: <20091203194332.GA4875@banane.novuscom.net> On Thu, Dec 03, 2009 at 08:33:19PM +0100, Manlio Perillo wrote: > Right now I'm doing a: username.decode('us-ascii', 'replace') Or like most frameworks you could let the application author deal with the problem, just pass the raw strings to the application. -- Henry Pr?cheur From manlio_perillo at libero.it Thu Dec 3 21:15:06 2009 From: manlio_perillo at libero.it (Manlio Perillo) Date: Thu, 03 Dec 2009 21:15:06 +0100 Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec In-Reply-To: <4B1804E2.9070807@doxdesk.com> References: <4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com> Message-ID: <4B181C4A.7010307@libero.it> And Clover ha scritto: > Manlio Perillo wrote: > >> However what about URI (that is, for PATH_INFO and the like)? >> For URI (if I remember correctly) the suggested encoding is UTF-8, so >> URLS should be decoded using > >> url.decode('utf-8', 'surrogateescape') > >> Is this correct? > > The currently-discussed proposal is ISO-8859-1, allowing the real bytes > to be trivially extracted. This is consistent with the other headers and > would be my preferred approach. > There is something that I don't understand. Some HTTP headers, like Accept-Language, contains data described as `token`, where: token = 1* So a token, IMHO, is an opaque string, and it SHOULD not decoded. In Python 3.x it SHOULD be a byte string. Text content is described as `TEXT`, where: The TEXT rule is only used for descriptive field contents and values that are not intended to be interpreted by the message parser. Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047 [14]. TEXT = The only type of data where TEXT can be used is `quoted-string`. A `quoted-string` only appears in well specified portions of an header. So, IMHO, it is *not* correct for a WSGI middleware, to return all HTTP headers as Unicode strings. This is up to the application/framework, that must parse each header, split it in component and handle them as more appropriate (as byte string, Unicode string or instance of some other data type). > [...] Regards Manlio From henry at precheur.org Thu Dec 3 23:02:26 2009 From: henry at precheur.org (Henry Precheur) Date: Thu, 3 Dec 2009 14:02:26 -0800 Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec In-Reply-To: <4B181C4A.7010307@libero.it> References: <4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com> <4B181C4A.7010307@libero.it> Message-ID: <20091203220226.GA15382@banane.novuscom.net> On Thu, Dec 03, 2009 at 09:15:06PM +0100, Manlio Perillo wrote: > There is something that I don't understand. > > Some HTTP headers, like Accept-Language, contains data described as > `token`, where: > > token = 1* > > So a token, IMHO, is an opaque string, and it SHOULD not decoded. > In Python 3.x it SHOULD be a byte string. I think this is more an issue that frameworks should deal with. By decoding every headers value to latin-1: * It keeps WSGI simple. Simple is good. * WSGI sticks to what RFC 2616 (Hypertext Transfer Protocol -- HTTP/1.1) says. WSGI is about HTTP, but that doesn't necessarily includes all other standards extending HTTP. * It's possible to convert latin-1 strings to bytes without losing data. -- Henry Pr?cheur From and-py at doxdesk.com Fri Dec 4 00:50:27 2009 From: and-py at doxdesk.com (And Clover) Date: Fri, 04 Dec 2009 00:50:27 +0100 Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec In-Reply-To: <4B181C4A.7010307@libero.it> References: <4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com> <4B181C4A.7010307@libero.it> Message-ID: <4B184EC3.9070804@doxdesk.com> Manlio Perillo wrote: > Words of *TEXT MAY contain characters from character sets other than > ISO-8859-1 [22] only when encoded according to the rules of RFC 2047 Yeah, this is, unfortunately, a lie. The rules of RFC 2047 apply only to RFC*822-family 'atoms' and not elsewhere; indeed, RFC2047 itself specifically denies that an encoded-word can go in a quoted-string. RFC2047 encoded-words are not on-topic in an HTTP header(*); this has been confirmed by newer development work on HTTPbis by Reschke et al. (http://tools.ietf.org/wg/httpbis/). The "correct" way of escaping header parameters in an RFC*822-family protocol would be RFC2231's complex encoding scheme, but HTTP is explicitly not an 822-family protocol despite sharing many of the same constructs. See http://tools.ietf.org/html/draft-reschke-rfc2231-in-http-06 for a strategy for how 2231 should interact with HTTP, but note that for now RFC2231-in-HTTP simply does not exist in any deployed tools. So for now there is basically nothing useful WSGI can do other than provide direct, byte-oriented (even if wrapped in 8859-1 unicode strings) access to headers. -- And Clover mailto:and at doxdesk.com http://www.doxdesk.com/ From manlio_perillo at libero.it Fri Dec 4 10:17:09 2009 From: manlio_perillo at libero.it (Manlio Perillo) Date: Fri, 04 Dec 2009 10:17:09 +0100 Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec In-Reply-To: <20091203220226.GA15382@banane.novuscom.net> References: <4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com> <4B181C4A.7010307@libero.it> <20091203220226.GA15382@banane.novuscom.net> Message-ID: <4B18D395.4080801@libero.it> Henry Precheur ha scritto: > On Thu, Dec 03, 2009 at 09:15:06PM +0100, Manlio Perillo wrote: >> There is something that I don't understand. >> >> Some HTTP headers, like Accept-Language, contains data described as >> `token`, where: >> >> token = 1* >> >> So a token, IMHO, is an opaque string, and it SHOULD not decoded. >> In Python 3.x it SHOULD be a byte string. > > I think this is more an issue that frameworks should deal with. By > decoding every headers value to latin-1: > > * It keeps WSGI simple. Simple is good. > It is just as simple as using byte strings, IMHO. It is not simple, it is convenient because of (if I understand correctly) how code is converted by 2to3. > * WSGI sticks to what RFC 2616 (Hypertext Transfer Protocol -- HTTP/1.1) > says. WSGI is about HTTP, but that doesn't necessarily includes all > other standards extending HTTP. > HTTP never says to consided whole headers as latin-1 text, IMHO. > * It's possible to convert latin-1 strings to bytes without losing data. > Yes, but it is quite stupid to first convert to Unicode and then convert again to byte string. It it true, however, that this does not happen often; but only for: - WSGI applications that implement an HTTP proxy - WSGI applications that needs to support HTTP Digest Authentication - WSGI applications that store encoded data in cookies Regards Manlio From manlio_perillo at libero.it Fri Dec 4 10:46:16 2009 From: manlio_perillo at libero.it (Manlio Perillo) Date: Fri, 04 Dec 2009 10:46:16 +0100 Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec In-Reply-To: <4B184EC3.9070804@doxdesk.com> References: <4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com> <4B181C4A.7010307@libero.it> <4B184EC3.9070804@doxdesk.com> Message-ID: <4B18DA68.7070702@libero.it> And Clover ha scritto: > Manlio Perillo wrote: > >> Words of *TEXT MAY contain characters from character sets other than >> ISO-8859-1 [22] only when encoded according to the rules of RFC 2047 > > Yeah, this is, unfortunately, a lie. The rules of RFC 2047 apply only to > RFC*822-family 'atoms' and not elsewhere; indeed, RFC2047 itself > specifically denies that an encoded-word can go in a quoted-string. > > RFC2047 encoded-words are not on-topic in an HTTP header(*); this has > been confirmed by newer development work on HTTPbis by Reschke et al. > (http://tools.ietf.org/wg/httpbis/). > Thanks. HTTPbis seems to fix all these problems: "Historically, HTTP has allowed field content with text in the ISO- 8859-1 [ISO-8859-1] character encoding and supported other character sets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII character encoding [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII characters. Recipients SHOULD treat other (obs-text) octets in field content as opaque data." This is the new rule for `quoted-string`: quoted-string = DQUOTE *( qdtext / quoted-pair ) DQUOTE qdtext = OWS / %x21 / %x23-5B / %x5D-7E / obs-text ; OWS / / obs-text obs-text = %x80-FF quoted-pair = "\" ( WSP / VCHAR / obs-text ) > The "correct" way of escaping header parameters in an RFC*822-family > protocol would be RFC2231's complex encoding scheme, but HTTP is > explicitly not an 822-family protocol despite sharing many of the same > constructs. See > http://tools.ietf.org/html/draft-reschke-rfc2231-in-http-06 for a > strategy for how 2231 should interact with HTTP, but note that for now > RFC2231-in-HTTP simply does not exist in any deployed tools. > It seems reasonable. > So for now there is basically nothing useful WSGI can do other than > provide direct, byte-oriented (even if wrapped in 8859-1 unicode > strings) access to headers. > Yes, this is what I think. I have some doubts about wrapping the headers in 8859-1 unicode strings, but luckily there is surrogateescape. Regards Manlio From henry at precheur.org Fri Dec 4 19:28:16 2009 From: henry at precheur.org (Henry Precheur) Date: Fri, 4 Dec 2009 10:28:16 -0800 Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec In-Reply-To: <4B18D395.4080801@libero.it> References: <4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com> <4B181C4A.7010307@libero.it> <20091203220226.GA15382@banane.novuscom.net> <4B18D395.4080801@libero.it> Message-ID: <20091204182816.GA2311@banane.novuscom.net> On Fri, Dec 04, 2009 at 10:17:09AM +0100, Manlio Perillo wrote: > It is just as simple as using byte strings, IMHO. No, it's not. There were lots of dicussions regarding this on the mailing list. One of the main issue is that the standard library supports bytes poorly. urllib for example expects strings not bytes. > > * WSGI sticks to what RFC 2616 (Hypertext Transfer Protocol -- HTTP/1.1) > > says. WSGI is about HTTP, but that doesn't necessarily includes all > > other standards extending HTTP. > > > > HTTP never says to consided whole headers as latin-1 text, IMHO. It does: When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. http://tools.ietf.org/html/rfc2616#section-3.7.1 > Yes, but it is quite stupid to first convert to Unicode and then convert > again to byte string. 99% of the time latin-1 will work. And converting from Unicode to bytes is not costly. 6 months ago I was a big fan of bytes, but bytes create more problems than they solve. -- Henry Pr?cheur From manlio_perillo at libero.it Fri Dec 4 19:40:55 2009 From: manlio_perillo at libero.it (Manlio Perillo) Date: Fri, 04 Dec 2009 19:40:55 +0100 Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec In-Reply-To: <20091204182816.GA2311@banane.novuscom.net> References: <4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com> <4B181C4A.7010307@libero.it> <20091203220226.GA15382@banane.novuscom.net> <4B18D395.4080801@libero.it> <20091204182816.GA2311@banane.novuscom.net> Message-ID: <4B1957B7.6040800@libero.it> Henry Precheur ha scritto: > On Fri, Dec 04, 2009 at 10:17:09AM +0100, Manlio Perillo wrote: >> It is just as simple as using byte strings, IMHO. > > No, it's not. There were lots of dicussions regarding this on the > mailing list. One of the main issue is that the standard library > supports bytes poorly. urllib for example expects strings not bytes. > I read last month discussions 3 day ago! The quote function supports byte strings, as an example. What are the functions that does not works with byte strings? >>> * WSGI sticks to what RFC 2616 (Hypertext Transfer Protocol -- HTTP/1.1) >>> says. WSGI is about HTTP, but that doesn't necessarily includes all >>> other standards extending HTTP. >>> >> HTTP never says to consided whole headers as latin-1 text, IMHO. > > It does: > > When no explicit charset parameter is provided by the sender, media > subtypes of the "text" type are defined to have a default charset value > of "ISO-8859-1" when received via HTTP. > > http://tools.ietf.org/html/rfc2616#section-3.7.1 > This is not correct. First of all, HTTP never says that whole headers are of type TEXT. Only specific components are of type TEXT. Moreover, HTTPbis has finally clarified this; TEXT is no more used, instead non ascii characters are to be considered opaque. Do you really want to define the new WSGI specification to be "against" the new (possible) HTTP spec? Of course it will work; but since some code in the standard library needs to be fixed (the wsgiref.util.application_uri, as an example), maybe it is better to fix it to work with byte strings. Just my two cents. > [...] Regards Manlio From henry at precheur.org Fri Dec 4 20:50:09 2009 From: henry at precheur.org (Henry Precheur) Date: Fri, 4 Dec 2009 11:50:09 -0800 Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec In-Reply-To: <4B1957B7.6040800@libero.it> References: <4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com> <4B181C4A.7010307@libero.it> <20091203220226.GA15382@banane.novuscom.net> <4B18D395.4080801@libero.it> <20091204182816.GA2311@banane.novuscom.net> <4B1957B7.6040800@libero.it> Message-ID: <20091204195009.GA5845@banane.novuscom.net> On Fri, Dec 04, 2009 at 07:40:55PM +0100, Manlio Perillo wrote: > What are the functions that does not works with byte strings? Just to make things clear, I was talking about Python 3. All the functions I tried not ending with _from_bytes raise an exception with bytes. This includes urllib.parse.parse_qs & urllib.parse.urlparse which are rather critical ... > First of all, HTTP never says that whole headers are of type TEXT. > Only specific components are of type TEXT. If parts of a header contain latin-1 characters, that means its encoding is latin-1 (at least partially). > Moreover, HTTPbis has finally clarified this; TEXT is no more used, > instead non ascii characters are to be considered opaque. Yes, but the HTTPbis draft also says: Historically, HTTP has allowed field content with text in the ISO-8859-1 character encoding. And WSGI is not about HTTP in a distant future, it's about HTTP right now. > Do you really want to define the new WSGI specification to be "against" > the new (possible) HTTP spec? I don't know why it would be "against" it. WSGI aims to handle HTTP in the real world. Just because the HTTPbis spec is released wont take all the garbage out of the web. There will still be latin-1 strings in headers passed around for the next 10 years. -- Henry Pr?cheur From manlio_perillo at libero.it Fri Dec 4 21:09:35 2009 From: manlio_perillo at libero.it (Manlio Perillo) Date: Fri, 04 Dec 2009 21:09:35 +0100 Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec In-Reply-To: <20091204195009.GA5845@banane.novuscom.net> References: <4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com> <4B181C4A.7010307@libero.it> <20091203220226.GA15382@banane.novuscom.net> <4B18D395.4080801@libero.it> <20091204182816.GA2311@banane.novuscom.net> <4B1957B7.6040800@libero.it> <20091204195009.GA5845@banane.novuscom.net> Message-ID: <4B196C7F.6080902@libero.it> Henry Precheur ha scritto: > On Fri, Dec 04, 2009 at 07:40:55PM +0100, Manlio Perillo wrote: >> What are the functions that does not works with byte strings? > > Just to make things clear, I was talking about Python 3. > I know. Unfortunately I don't have installed Python 3, I'm just reading the code. > All the functions I tried not ending with _from_bytes raise an exception > with bytes. This includes urllib.parse.parse_qs & urllib.parse.urlparse > which are rather critical ... > Ah, ok. Can you show me the traceback of parse_qs? Thanks. >> First of all, HTTP never says that whole headers are of type TEXT. >> Only specific components are of type TEXT. > > If parts of a header contain latin-1 characters, that means its > encoding is latin-1 (at least partially). > This is not completely true. > [...] > And WSGI is not about HTTP in a distant future, it's about HTTP right > now. > >> Do you really want to define the new WSGI specification to be "against" >> the new (possible) HTTP spec? > > I don't know why it would be "against" it. Well, I have quoted it for this reason. What I mean is that, IMHO: - Using Unicode strings in WSGI is an abuse of Unicode string - This abuse is not justified by the HTTP spec > [...] Regards Manlio From manlio_perillo at libero.it Sun Dec 6 14:43:43 2009 From: manlio_perillo at libero.it (Manlio Perillo) Date: Sun, 06 Dec 2009 14:43:43 +0100 Subject: [Web-SIG] CGI WSGI and Unicode Message-ID: <4B1BB50F.1040801@libero.it> Hi. I'm playing with Python 3.x, current revision. I have noted that the data in the os.environ are noe Unicode strings. In a CGI application, HTTP headers are Unicode strings, and are decoded using system default encoding. In a future WSGI application, HTTP headers are Unicode strings, and are decoded using latin-1 encoding. In both cases, 'surrogateescape' is used. Can this cause troubles and incompatibility problems? I'm interested in special header handling, like cookies, that contain opaque data. Thanks Manlio From manlio_perillo at libero.it Mon Dec 7 11:51:31 2009 From: manlio_perillo at libero.it (Manlio Perillo) Date: Mon, 07 Dec 2009 11:51:31 +0100 Subject: [Web-SIG] CGI WSGI and Unicode In-Reply-To: <88e286470912061736o7c1ab6b2v13aad4bc935bfb3d@mail.gmail.com> References: <4B1BB50F.1040801@libero.it> <88e286470912061736o7c1ab6b2v13aad4bc935bfb3d@mail.gmail.com> Message-ID: <4B1CDE33.3040805@libero.it> Graham Dumpleton ha scritto: Note: I'm sending the entire message to the mailing list. > 2009/12/7 Manlio Perillo : >> Hi. >> >> I'm playing with Python 3.x, current revision. >> >> I have noted that the data in the os.environ are noe Unicode strings. >> >> In a CGI application, HTTP headers are Unicode strings, and are decoded >> using system default encoding. >> In a future WSGI application, HTTP headers are Unicode strings, and are >> decoded using latin-1 encoding. >> >> In both cases, 'surrogateescape' is used. > > No, 'surrogateescape' is not necessary when using latin-1, or at least > for variables which use latin-1. > The problem is that not all browsers use latin-1. As an example with HTTP Digest authentication. > Use of 'surrogateescape' is only relevant in the context of some web > servers and only relevant for specific variables, some of which aren't > even part of set of variables which are required by WSGI. > > For example, in Apache/mod_wsgi, 'surrogateescape' is used on > DOCUMENT_ROOT and SCRIPT_FILENAME. What about HTTP_COOKIE? > [...] >> Can this cause troubles and incompatibility problems? >> I'm interested in special header handling, like cookies, that contain >> opaque data. > > The issues which CGI/WSGI bridge in Python 3.X has been discussed > previously on the list. It seems I missed it. > It is acknowledged that there are problems to > be solved there, at least to extent that CGI/WSGI bridge > implementation has to correct the encoding, and also that that may > only be solvable in Python 3.1 onwards due to not having access to > what encoding was use for environment variables in Python 3.0. Not > many people care about CGI these days and so no one has been bother to > come up with working CGI/WSGI bridge for Python 3.X. > CGI is very important; there are some kind of web applications that have problems when executing in a long running process. As an example, I prefer to run Trac and Mercurial instances as CGI. > Graham Regards Manlio From mborch at gmail.com Mon Dec 7 12:19:04 2009 From: mborch at gmail.com (Malthe Borch) Date: Mon, 07 Dec 2009 12:19:04 +0100 Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec In-Reply-To: <4B184EC3.9070804@doxdesk.com> References: <4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com> <4B181C4A.7010307@libero.it> <4B184EC3.9070804@doxdesk.com> Message-ID: <4B1CE4A8.9040104@gmail.com> On 12/4/09 12:50 AM, And Clover wrote: > So for now there is basically nothing useful WSGI can do other than > provide direct, byte-oriented (even if wrapped in 8859-1 unicode > strings) access to headers. You could argue that this is perhaps a good reason to replace ``environ`` with something that interprets the headers according to how HTTP is actually used in the real world. It may be that WSGI should use bytes everywhere and the recommended usage would be via a decorator (which could cache computations on the environ dictionary): e.g. the raw application handler versus one decorated with an imaginary ``webob`` function. def app(environ, start_response): ... @webob def app(request): ... It is often said that WSGI should be practical, but in actual usage, I think most developers use a request/response abstraction layer. Middlewares are usually shrink-wrapped library code that could handle a bytes-based environ dict (they'd have to explicitly decode the headers of interest). \malthe From graham.dumpleton at gmail.com Mon Dec 7 12:19:42 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Mon, 7 Dec 2009 22:19:42 +1100 Subject: [Web-SIG] CGI WSGI and Unicode In-Reply-To: <4B1CDE33.3040805@libero.it> References: <4B1BB50F.1040801@libero.it> <88e286470912061736o7c1ab6b2v13aad4bc935bfb3d@mail.gmail.com> <4B1CDE33.3040805@libero.it> Message-ID: <88e286470912070319t4a9a5a4p4d765667eef312fe@mail.gmail.com> 2009/12/7 Manlio Perillo : > Graham Dumpleton ha scritto: > > Note: I'm sending the entire message to the mailing list. > >> 2009/12/7 Manlio Perillo : >>> Hi. >>> >>> I'm playing with Python 3.x, current revision. >>> >>> I have noted that the data in the os.environ are noe Unicode strings. >>> >>> In a CGI application, HTTP headers are Unicode strings, and are decoded >>> using system default encoding. >>> In a future WSGI application, HTTP headers are Unicode strings, and are >>> decoded using latin-1 encoding. >>> >>> In both cases, 'surrogateescape' is used. >> >> No, 'surrogateescape' is not necessary when using latin-1, or at least >> for variables which use latin-1. >> > > The problem is that not all browsers use latin-1. > As an example with HTTP Digest authentication. You seem to miss one important point. When converting bytes to unicode as latin-1, the surrogate escape mechanism never comes into play. This is because all byte values can be represented in latin-1 due it being a single byte encoding which preserves the original bytes intact. >> Use of 'surrogateescape' is only relevant in the context of some web >> servers and only relevant for specific variables, some of which aren't >> even part of set of variables which are required by WSGI. >> >> For example, in Apache/mod_wsgi, 'surrogateescape' is used on >> DOCUMENT_ROOT and SCRIPT_FILENAME. > > What about HTTP_COOKIE? You trimmed part of my response which is very important. For DOCUMENT_ROOT and SCRIPT_FILENAME they must be dealt with per the filesystem encoding and not latin-1. If you don't, the result may not be compatible with input to file system routines in Python 3.1 which sort of expect file system encoding plus surrogate escape. As I say though, those variables aren't relevant to most WSGI hosting mechanisms and even for those which the web server provides them, nearly all WSGI applications will not care about them. In Apache/mod_wsgi worry about them because Apache/mod_wsgi provides features which allow one to define Apache style handlers based on file type where the handler for the arbitrary file type is implemented as a WSGI application. In that case the file the URL mapped to, ie., SCRIPT_FILENAME, is an arbitrary file and not a WSGI script file. In the case of HTTP_COOKIE, as far as WSGI adapter goes it just converts it to unicode as per latin-1. So, it is washing its hands of what to do with it because it cannot know and only WSGI application can. Because latin-1, no surrogate escape involved. In the WSGI application where it knows what encoding may be used then the WSGI application can convert back to bytes and to a different encoding, using surrogate escape if it wants to to ensure no outright error if bytes can't be represented in that alternate encoding. >> [...] >>> Can this cause troubles and incompatibility problems? >>> I'm interested in special header handling, like cookies, that contain >>> opaque data. >> >> The issues which CGI/WSGI bridge in Python 3.X has been discussed >> previously on the list. > > It seems I missed it. > >> It is acknowledged that there are problems to >> be solved there, at least to extent that CGI/WSGI bridge >> implementation has to correct the encoding, and also that that may >> only be solvable in Python 3.1 onwards due to not having access to >> what encoding was use for environment variables in Python 3.0. Not >> many people care about CGI these days and so no one has been bother to >> come up with working CGI/WSGI bridge for Python 3.X. >> > > CGI is very important; there are some kind of web applications that have > problems when executing in a long running process. > > As an example, I prefer to run Trac and Mercurial instances as CGI. Yes I agree that there are some valid uses of CGI/WSGI bridge although those two aren't the ones I would have in mind. For the record, CGI/WSGI adapters should also protect the original stdin/stdout so WSGI application doesn't cause problems by using 'print' or do other odd stuff with input. I haven't seen a single CGI/WSGI adapter which does it in a way that I would say is correct, or at least robust against users doing stupid things, so encoding issues aren't the only thing where CGI/WSGI adapters need work. Graham From arw1961 at yahoo.com Mon Dec 7 21:23:18 2009 From: arw1961 at yahoo.com (Aaron Watters) Date: Mon, 7 Dec 2009 12:23:18 -0800 (PST) Subject: [Web-SIG] CGI WSGI and Unicode In-Reply-To: <88e286470912070319t4a9a5a4p4d765667eef312fe@mail.gmail.com> Message-ID: <106806.11256.qm@web32008.mail.mud.yahoo.com> --- On Mon, 12/7/09, Graham Dumpleton wrote: > For the record, CGI/WSGI adapters should also protect the > original > stdin/stdout so WSGI application doesn't cause problems by > using > 'print' or do other odd stuff with input. I haven't seen a > single > CGI/WSGI adapter which does it in a way that I would say is > correct, > or at least robust against users doing stupid things... "There is no fool proof software: fools are too clever" "Doctor, it hurts when I do this." "Don't do that." Some words of wisdom from folklore... (or if anyone knows the correct attribution, please inform). -- Aaron Watters http://listtree.appspot.com http://whiffdoc.appspot.com === an apple every 8 hours will keep 3 doctors away. -- kliban From and-py at doxdesk.com Tue Dec 8 16:27:41 2009 From: and-py at doxdesk.com (And Clover) Date: Tue, 08 Dec 2009 16:27:41 +0100 Subject: [Web-SIG] CGI WSGI and Unicode In-Reply-To: <4B1BB50F.1040801@libero.it> References: <4B1BB50F.1040801@libero.it> Message-ID: <4B1E706D.3000502@doxdesk.com> Manlio Perillo wrote: > In a CGI application, HTTP headers are Unicode strings, and are decoded > using system default encoding. > In a future WSGI application, HTTP headers are Unicode strings, and are > decoded using latin-1 encoding. Yes. As proposed, WSGI 1.1 would require CGI-to-WSGI handler to undo the decode stage caused by reading environ using the default encoding. At least this is now reliably possible thanks to surrogateescape. PATH_INFO is the only really important HTTP-related environment variable for Unicode. Potentially SCRIPT_NAME could also be significant in relation to PATH_INFO. The HTTP headers don't massively matter because there are almost never any non-ASCII characters in them. Previously the job of undoing an unwanted decode step was dumped on whatever read the PATH_INFO; usually a routing component, which would have to make guesses with typically poor results. The CGI adapter is in a much better place to do it, being closer to the server. > The problem is that not all browsers use latin-1. Not WSGI's problem. WSGI will deliver bytes encoded into Unicode strings, not ready-to-use Unicode strings. It is up to the application to decide how they want to handle those bytes; maybe they want Latin-1 and can do nothing, maybe they want to recode to UTF-8, maybe something else completely. No solution satisfies every app so there is always going to have to be a recode step somewhere. An application that doesn't want to think about this will use a framework that does it for them. > What about HTTP_COOKIE? For what it's worth, the choice of Latin-1 here results in the 'right' Unicode string for more browsers than any other potential encoding. In any case as previously discussed, non-ASCII cookies are already totally broken everywhere and hence used by no-one. -- And Clover mailto:and at doxdesk.com http://www.doxdesk.com/ From lavendula6654 at gmail.com Fri Dec 11 17:54:08 2009 From: lavendula6654 at gmail.com (Elaine Haight) Date: Fri, 11 Dec 2009 08:54:08 -0800 Subject: [Web-SIG] Software Development Courses Message-ID: <3652e3600912110854w4b010078l53a493b31ce22d2@mail.gmail.com> Foothill College is offering two courses of interest to web application software developers: Ajax and Python. These 11-week courses are held from January through March. The Ajax class is entirely online, and the Python class meets Thursday evenings at the Middlefield campus in Palo Alto. ?Application Software Development with Ajax? is a course designed for students who are already familiar with some type of programming, and have introductory knowledge of JavaScript and html. For more information, go to: http://www.foothill.edu/schedule/schedule.php and choose Department: ?COIN?, quarter: ?Winter 2010?, and course number ?71?. ?Introduction to Python Programming? meets Thursday evenings and is also designed for students who are familiar with some type of programming. The instructor is Marilyn Davis. For more information or to register, go to: http://www.foothill.edu/schedule/schedule.php and choose Department: ?CIS?, quarter: ?Winter 2010?, and course number ?68K?. If you would like to sign up for a class, please register beforehand by going to: http://www.foothill.fhda.edu/reg/index.php If you do not register ahead of time, the class you want may be cancelled! If you have questions, you can contact: h a i g h t E l a i n e AT f o o t h i l l . e d u -------------- next part -------------- An HTML attachment was scrubbed... URL: From orsenthil at gmail.com Sun Dec 20 19:08:19 2009 From: orsenthil at gmail.com (Senthil Kumaran) Date: Sun, 20 Dec 2009 23:38:19 +0530 Subject: [Web-SIG] [RFC] urllib2 requests history + HEAD support Message-ID: <20091220180819.GB4385@ubuntu.ubuntu-domain> I need your opinion on this request. Python Standard Library module urllib2 has support GET and POST. There was a feature request to add support for HEAD requests. While that is valid feature request, there was suggestion to include a history of the requests in the module. I don't find any references in the RFCS for any such requirement to maintain a history of requests. Do you have any opinion on whether is it a good idea to have history of requests in the urllib2 module? I personally feel that history of requests can be easier tracked by the clients. -- Senthil On Sun, Dec 20, 2009 at 05:59:48PM +0000, Senthil Kumaran wrote: > > Senthil Kumaran added the comment: > > Having a HEAD request for urllib2 might be a good idea. I shall use this > patch to add the functionality. > > But, having a history support in the urllib2 module is not a good idea > IMO. It is best left to the clients which might use urllib2. > > ---------- > > _______________________________________ > Python tracker > > _______________________________________ -- Senthil Shannon's Observation: Nothing is so frustrating as a bad situation that is beginning to improve. From henry at precheur.org Mon Dec 21 19:24:38 2009 From: henry at precheur.org (Henry Precheur) Date: Mon, 21 Dec 2009 18:24:38 +0000 Subject: [Web-SIG] [RFC] urllib2 requests history + HEAD support In-Reply-To: <20091220180819.GB4385@ubuntu.ubuntu-domain> References: <20091220180819.GB4385@ubuntu.ubuntu-domain> Message-ID: <20091221182438.GA27899@li60-23.members.linode.com> On Sun, Dec 20, 2009 at 11:38:19PM +0530, Senthil Kumaran wrote: > I need your opinion on this request. > > > Python Standard Library module urllib2 has support GET and POST. > There was a feature request to add support for HEAD requests. It would be nice to have other methods too, like PUT & DELETE: http://tools.ietf.org/html/rfc2616#page-52 > While that is valid feature request, there was suggestion to include a > history of the requests in the module. I don't find any references in > the RFCS for any such requirement to maintain a history of requests. > > Do you have any opinion on whether is it a good idea to have history > of requests in the urllib2 module? I personally feel that history of > requests can be easier tracked by the clients. This should be done by the client. -- Henry Pr?cheur From orsenthil at gmail.com Tue Dec 22 01:43:32 2009 From: orsenthil at gmail.com (Senthil Kumaran) Date: Tue, 22 Dec 2009 06:13:32 +0530 Subject: [Web-SIG] [RFC] urllib2 requests history + HEAD support In-Reply-To: <20091221182438.GA27899@li60-23.members.linode.com> References: <20091220180819.GB4385@ubuntu.ubuntu-domain> <20091221182438.GA27899@li60-23.members.linode.com> Message-ID: <20091222004331.GA5669@ubuntu.ubuntu-domain> On Mon, Dec 21, 2009 at 06:24:38PM +0000, Henry Precheur wrote: > On Sun, Dec 20, 2009 at 11:38:19PM +0530, Senthil Kumaran wrote: > > There was a feature request to add support for HEAD requests. > > It would be nice to have other methods too, like PUT & DELETE: > > http://tools.ietf.org/html/rfc2616#page-52 Yes, I agree. Methods like PUT & DELETE also makes sense in urllib2. Folks currently wrap those around httplib. HEAD can be implement in a straight forward way using httplib, but if Request has a method parameter, which takes HEAD,PUT or DELETE and behaves accordingly, that would make it complete. And as expected, many voted down -1 on history support. ( I guess web-sig defaults the reply-to To: rather than List:) Thanks, -- Senthil "I have a bone to pick, and a few to break." -- Anonymous From mborch at gmail.com Mon Dec 7 12:19:36 2009 From: mborch at gmail.com (Malthe Borch) Date: Mon, 07 Dec 2009 11:19:36 -0000 Subject: [Web-SIG] Move to bless Graham's WSGI 1.1 as official spec In-Reply-To: <4B184EC3.9070804@doxdesk.com> References: <4B179937.8070305@libero.it> <4B1804E2.9070807@doxdesk.com> <4B181C4A.7010307@libero.it> <4B184EC3.9070804@doxdesk.com> Message-ID: <4B1CE4A8.9040104@gmail.com> On 12/4/09 12:50 AM, And Clover wrote: > So for now there is basically nothing useful WSGI can do other than > provide direct, byte-oriented (even if wrapped in 8859-1 unicode > strings) access to headers. You could argue that this is perhaps a good reason to replace ``environ`` with something that interprets the headers according to how HTTP is actually used in the real world. It may be that WSGI should use bytes everywhere and the recommended usage would be via a decorator (which could cache computations on the environ dictionary): e.g. the raw application handler versus one decorated with an imaginary ``webob`` function. def app(environ, start_response): ... @webob def app(request): ... It is often said that WSGI should be practical, but in actual usage, I think most developers use a request/response abstraction layer. Middlewares are usually shrink-wrapped library code that could handle a bytes-based environ dict (they'd have to explicitly decode the headers of interest). \malthe From tseaver at palladion.com Sun Dec 27 15:26:14 2009 From: tseaver at palladion.com (Tres Seaver) Date: Sun, 27 Dec 2009 09:26:14 -0500 Subject: [Web-SIG] Future of WSGI In-Reply-To: References: <4B0BA030.5010201@gmail.com> <4B0C4FFF.5070305@gmail.com> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Malthe Borch wrote: > > 2009/11/24 Ian Bicking : >> Why does this matter? > > It's all convention, but the CGI interpretation was to read the HTTP > request line by line until a blank line came and that was the > environment. Everything after that is the body. "Headers", not environment: the CGI environment is literally the os.environ set up by the CGI parent process before forking and execing the script. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAks3boYACgkQ+gerLs4ltQ5coACg0ijXgG1wy1BdNnPzN2Jm2FLG 1R0Anj0/o6zwjtatFERoQ2HS3BOgyVEA =RhAH -----END PGP SIGNATURE-----