From armin.ronacher at active-4.com Mon May 4 17:02:48 2009 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Mon, 4 May 2009 15:02:48 +0000 (UTC) Subject: [Web-SIG] Python 3.0 and WSGI 1.0. References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> Message-ID: Hello everybody, I just recently started looking at supporting Python 3 with one of my libraries (Werkzeug), mainly because the MoinMoin projects considers using it which uses the library in question. Right now what Werkzeug does is consider HTTP being Unicode aware in the sense that everything that carries text data is encoded and decoded into a known encoding. This is partially against the specification and not entirely correct, but it works the best on modern browsers and is also what Django and Paste are doing. It's basically that the incoming request data is .decode(encoding)d (usually utf-8) before passed to the user code and unicode data is encoded back into the same encoding before it's sent to the server. Now why is the current behavior of Python 3 a problem here? The encode, decode hack from above is obviously a solution for these kinds of applications, albeit not a good one. Interfaces like mod_wsgi already have the data as bytestring, would decode it from latin1 just that the application can encode it back and decode as utf-8. Not only is this slow but also does this mean that the code does not survive a run through 2to3. Now you could argue that the libraries where wrong in the first place and should support unicode strings that were encoded from latin1 and decoded, but seems like very few libraries support that. Now which strings carry data that could contain non-ascii characters from a source with an unknown encoding? Right now these are the following: * PATH_INFO * SCRIPT_NAME * QUERY_STRING * CONTENT_TYPE * HTTP_* Also all headers that carry non integer values (like HTTP_CONTENT_TYPE and CONTENT_TYPE). Now it's true that the headers should not contain non latin1 values but reality shows that they do. Cookies are transmitted as headers as well and no browser complains if you put utf-8 encoded stuff into it. It may be the case that for the browser this looks like latin1, but in the end the application decodes it from utf-8 and is happy. Data sent from the application can continue to work like they do currently. However for django, Werkzeug, paste and many others that support unicode output will just check if the output is unicode, and if that's the case, encode to the desired encoding. Also people abuse middlewares a lot and they deal with incoming and outgoing data as well. One can expect these middlewares to work on known encodings as well so those would do the encode / decode dance too. If one knows the encoding of the environ, then the webserver. Apparently there are issues getting the encoding of the environ but those won't go away when moving that to the web application. Because of that I propose that Python 3 would ship a version of wsgiref with Python 3.1 that uses bytestrings for the headers in question and add a section on Python 3 compatibility based on that to PEP 333. I volunteer for writing a new section on Python 3 in PEP 333 :-) Regards, Armin From graham.dumpleton at gmail.com Tue May 5 02:21:14 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 5 May 2009 10:21:14 +1000 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> Message-ID: <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> 2009/5/5 Armin Ronacher : > Hello everybody, > > I just recently started looking at supporting Python 3 with one of my libraries > (Werkzeug), mainly because the MoinMoin projects considers using it which uses > the library in question. ?Right now what Werkzeug does is consider HTTP being > Unicode aware in the sense that everything that carries text data is encoded and > decoded into a known encoding. > > This is partially against the specification and not entirely correct, but it > works the best on modern browsers and is also what Django and Paste are doing. > > It's basically that the incoming request data is .decode(encoding)d (usually > utf-8) before passed to the user code and unicode data is encoded back into the > same encoding before it's sent to the server. > > Now why is the current behavior of Python 3 a problem here? ?The encode, decode > hack from above is obviously a solution for these kinds of applications, albeit > not a good one. ?Interfaces like mod_wsgi already have the data as bytestring, > would decode it from latin1 just that the application can encode it back and > decode as utf-8. ?Not only is this slow but also does this mean that the code > does not survive a run through 2to3. > > Now you could argue that the libraries where wrong in the first place and should > support unicode strings that were encoded from latin1 and decoded, but seems > like very few libraries support that. > > Now which strings carry data that could contain non-ascii characters from a > source with an unknown encoding? ?Right now these are the following: > > ?* PATH_INFO > ?* SCRIPT_NAME > ?* QUERY_STRING > ?* CONTENT_TYPE > ?* HTTP_* Depending on underlying web server that WSGI adapter runs on, there might also be: REQUEST_URI PATH_TRANSLATED (??) Yes I know these aren't required for WSGI, except to the extent that WSGI specification says: "A server or gateway should attempt to provide as many other CGI variables as are applicable." Would have to check CGI but there may be more. The way I thus read this is that keys are always strings, values will be strings, except for specific list of entries where values would be bytes. Also, presume that wsgi.url_scheme will have string value. Where things get difficult for me with Apache is where users can use SetEnv or mod_rewrite to define additional key/values to be added to the WSGI environment. For example: SetEnv trac.env_path /some/path I can't see but have choice but to pass such settings through as strings, else more than likely would cause problems for applications. Problem is it isn't clear what encoding stuff can be in Apache configuration. At the moment latin-1 is assumed. Things though get more complicated when mod_rewrite is used, as the values could be derived from components of the URL which are being treated as bytes above. For example: RewriteCond %{THE_REQUEST} ^\ *([A-Z]+)\ *(.*)\ *(HTTP/.*)$ RewriteRule . - [E=UNPARSERD_URI:%1] So, this is creating a new UNPARSED_URI value which is original URL as appeared in the request line. I can't know that strictly speaking that this should be bytes. As such, I think all I can do is always pass through additional values as string, interpreted as latin-1. If some special case handling is required, would be up to WSGI application. I am not too keen on special configuration directives to allow encoding and/or whether bytes used, to be specified for each possible variable being set. Anyway, this is special case stuff and if being done is likely going to be special to Apache/mod_wsgi. If people want consistency, they should just implement it as a WSGI middleware where they can rather than usind mod_rewrite fiddles. Now, if we are going to start using bytes for request headers, there is the other question of response data. The original proposal in amendments was that application should provide bytes, but that WSGI adapter must accept either bytes or strings, with strings interpreted as latin-1. Is there sense in being more strict in this case? In Python 2.X some WSGI adapters only allow Python 2.X strings (ie., bytes) and reject unicode strings. Others will convert unicode strings, but rather than use latin-1, apply the default Python encoding. Thus, there is no consistency. As to wsgi.file_wrapper, the only logical thing seems to be required file object to return bytes, ie. raw mode, and not be in text mode. Ultimately I am just implementing the WSGI adapter, I'll follow whatever is decided. I am not in a position, since I don't develop stuff that runs on it, to know what is best. So, as long as it is clear what should be passed through as bytes for environment, ie., there is an all inclusive list, and don't somehow have to guess, then am fine either way. I'd just like to see some decision and for that decision not to be some time next year as am holding up mod_wsgi 3.0 until things have been clarified. :-( Graham > Also all headers that carry non integer values (like HTTP_CONTENT_TYPE and > CONTENT_TYPE). ?Now it's true that the headers should not contain non latin1 > values but reality shows that they do. ?Cookies are transmitted as headers as > well and no browser complains if you put utf-8 encoded stuff into it. ?It may be > the case that for the browser this looks like latin1, but in the end the > application decodes it from utf-8 and is happy. > > Data sent from the application can continue to work like they do currently. > However for django, Werkzeug, paste and many others that support unicode output > will just check if the output is unicode, and if that's the case, encode to the > desired encoding. > > Also people abuse middlewares a lot and they deal with incoming and outgoing > data as well. ?One can expect these middlewares to work on known encodings as > well so those would do the encode / decode dance too. > > If one knows the encoding of the environ, then the webserver. ?Apparently there > are issues getting the encoding of the environ but those won't go away when > moving that to the web application. > > Because of that I propose that Python 3 would ship a version of wsgiref with > Python 3.1 that uses bytestrings for the headers in question and add a section > on Python 3 compatibility based on that to PEP 333. > > I volunteer for writing a new section on Python 3 in PEP 333 :-) > > > Regards, > Armin > > _______________________________________________ > Web-SIG mailing list > Web-SIG at python.org > Web SIG: http://www.python.org/sigs/web-sig > Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com > From graham.dumpleton at gmail.com Tue May 5 12:04:02 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Tue, 5 May 2009 20:04:02 +1000 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <4A000694.9070401@active-4.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> <4A000694.9070401@active-4.com> Message-ID: <88e286470905050304l4483ec52n9085fbce6b6912cf@mail.gmail.com> 2009/5/5 Armin Ronacher : > Hi, > > Graham Dumpleton wrote: >> I can't see but have choice but to pass such settings through as >> strings, else more than likely would cause problems for applications. >> Problem is it isn't clear what encoding stuff can be in Apache >> configuration. At the moment latin-1 is assumed. > > Because those information does not have a specified encoding I can see > nothing wrong with it passing that information as bytestrings. ?I would > have no problem passing *all* values as bytestrings. At what point does that become an inconvenience though? I guess that is my concern, because if one has to do too many manual conversions in an application, people will start to complain it becomes unwieldy to use. In other words, you make it easier or more logical for frameworks, but do you end up putting more burden on applications for stuff outside those core values. So, for those core CGI values which the framework is going to modify even before an application sees them, then fine. Is the framework also going to set the rules as to what encoding is used for other values in the WSGI environment and convert them per that encoding when an application requests them, or is the application always going to have to deal with them as bytes? As I keep saying, you guys who write the frameworks and applications are going to know better than I, I am just challenging the notions as a way of making people think about it so the end result is what is the most logical thing to do. ;-) >> In Python 2.X some WSGI adapters only allow Python 2.X strings (ie., >> bytes) and reject unicode strings. Others will convert unicode >> strings, but rather than use latin-1, apply the default Python >> encoding. Thus, there is no consistency. > > I think most will assert-reject unicode types and in -O just ignore them > and fail in some way. ?I haven't seen any of those doing a > unicode->string conversion by encoding which btw is disallowed by the > PEP anyways. A CGI/WSGI bridge, if no explicit checks are made to disallow stuff other than strings, will usually attempt to write to sys.stdout whatever you give it. Thus unicode strings can be written and presumably default encoding is applied. >>> sys.stdout.write(u"abcd\n") abcd One can even write buffers. >>> sys.stdout.write(buffer("abcd\n")) abcd >> Ultimately I am just implementing the WSGI adapter, I'll follow >> whatever is decided. I am not in a position, since I don't develop >> stuff that runs on it, to know what is best. So, as long as it is >> clear what should be passed through as bytes for environment, ie., >> there is an all inclusive list, and don't somehow have to guess, then >> am fine either way. I'd just like to see some decision and for that >> decision not to be some time next year as am holding up mod_wsgi 3.0 >> until things have been clarified. :-( > > I hope we can find a solution for that before the Python 3.1 release, > otherwise there is another wsgiref release with the current behavior > which is just wrong. We can hope, but I'm not holding my breath. It is going to be rather stupid though if what ends up being the standard is dictated by how wsgiref works in 3.1 as is. Graham From fumanchu at aminus.org Tue May 5 16:55:51 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Tue, 5 May 2009 07:55:51 -0700 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470905050304l4483ec52n9085fbce6b6912cf@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> <4A000694.9070401@active-4.com> <88e286470905050304l4483ec52n9085fbce6b6912cf@mail.gmail.com> Message-ID: Graham Dumpleton wrote: > 2009/5/5 Armin Ronacher : >> Graham Dumpleton wrote: >>> I can't see but have choice but to pass such settings through as >>> strings, else more than likely would cause problems for applications. >>> Problem is it isn't clear what encoding stuff can be in Apache >>> configuration. At the moment latin-1 is assumed. >> Because those information does not have a specified encoding I can see >> nothing wrong with it passing that information as bytestrings. I would >> have no problem passing *all* values as bytestrings. > > At what point does that become an inconvenience though? I guess that > is my concern, because if one has to do too many manual conversions in > an application, people will start to complain it becomes unwieldy to > use. In other words, you make it easier or more logical for > frameworks, but do you end up putting more burden on applications for > stuff outside those core values. > > So, for those core CGI values which the framework is going to modify > even before an application sees them, then fine. Is the framework also > going to set the rules as to what encoding is used for other values in > the WSGI environment and convert them per that encoding when an > application requests them, or is the application always going to have > to deal with them as bytes? > > As I keep saying, you guys who write the frameworks and applications > are going to know better than I, I am just challenging the notions as > a way of making people think about it so the end result is what is the > most logical thing to do. ;-) In short: it's pretty easy for a framework to default to utf-8 for everything, yet give application developers ways to override that. See, for example, the cherrypy.tools.encoding Tool in our python3 branch--it's moved from running "sometime" after the page handler, to wrapping the page handler so all page handlers emit bytes. That makes it possible for everyone to use unicode strings everywhere, yet still allow some to specify exact bytes as necessary. In shorter: don't worry about that part, we've got it covered. ;) Robert Brewer fumanchu at aminus.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From ianb at colorstudy.com Tue May 5 19:01:07 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 5 May 2009 12:01:07 -0500 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> <4A000694.9070401@active-4.com> <88e286470905050304l4483ec52n9085fbce6b6912cf@mail.gmail.com> Message-ID: Philip Jenvey brought this to my attention: http://www.python.org/dev/peps/pep-0383/ It's a UTF8 encoding and decoding scheme that encodes illegal bytes in such a way that you can decode to get the original bytes object, and thus transcode to another encoding. It's intended for cases exactly like WSGI. -- Ian Bicking | http://blog.ianbicking.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From graham.dumpleton at gmail.com Wed May 6 05:14:04 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Wed, 6 May 2009 13:14:04 +1000 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> <4A000694.9070401@active-4.com> <88e286470905050304l4483ec52n9085fbce6b6912cf@mail.gmail.com> Message-ID: <88e286470905052014h342e58c7m58dc655a5b4be543@mail.gmail.com> 2009/5/6 Ian Bicking : > Philip Jenvey brought this to my attention: > > ? http://www.python.org/dev/peps/pep-0383/ > > It's a UTF8 encoding and decoding scheme that encodes illegal bytes in such > a way that you can decode to get the original bytes object, and thus > transcode to another encoding.? It's intended for cases exactly like WSGI. Care to explain then how that would in practice be used while I try and reread it a few times to try and understand it myself? :-) Graham From ianb at colorstudy.com Wed May 6 05:27:17 2009 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 5 May 2009 22:27:17 -0500 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470905052014h342e58c7m58dc655a5b4be543@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> <4A000694.9070401@active-4.com> <88e286470905050304l4483ec52n9085fbce6b6912cf@mail.gmail.com> <88e286470905052014h342e58c7m58dc655a5b4be543@mail.gmail.com> Message-ID: On Tue, May 5, 2009 at 10:14 PM, Graham Dumpleton < graham.dumpleton at gmail.com> wrote: > 2009/5/6 Ian Bicking : > > Philip Jenvey brought this to my attention: > > > > http://www.python.org/dev/peps/pep-0383/ > > > > It's a UTF8 encoding and decoding scheme that encodes illegal bytes in > such > > a way that you can decode to get the original bytes object, and thus > > transcode to another encoding. It's intended for cases exactly like > WSGI. > > Care to explain then how that would in practice be used while I try > and reread it a few times to try and understand it myself? :-) > I don't particularly know, except I think you'd do things like: environ['PATH_INFO'] = urllib.unquote(http_byte_path).decode('utf8', 'python-escape') Then if the encoding was wrong, you could transcode like: environ['PATH_INFO'] = environ['PATH_INFO'].encode('utf8', 'python-escape').decode('latin1', 'python-escape') Note that you need to know the encoding that was used (utf8 in this case) and that python-escape was used. It has been suggested that the server should put the encoding it used into the environment. When transcoding this should also be updated. It's not clear what python-escape is going to do, I don't think that's been determined. Probably it'll put \x00 or something in the unicode string to mark raw bytes. -- Ian Bicking | http://blog.ianbicking.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From graham.dumpleton at gmail.com Fri May 8 13:34:51 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Fri, 8 May 2009 21:34:51 +1000 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> Message-ID: <88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com> 2009/5/5 Graham Dumpleton >>> Now, if we are going to start using bytes for request headers, there >>> is the other question of response data. >>> >>> The original proposal in amendments was that application should >>> provide bytes, but that WSGI adapter must accept either bytes or >>> strings, with strings interpreted as latin-1. >>> >>> Is there sense in being more strict in this case? >>> >>> In Python 2.X some WSGI adapters only allow Python 2.X strings (ie., >>> bytes) and reject unicode strings. Others will convert unicode >>> strings, but rather than use latin-1, apply the default Python >>> encoding. Thus, there is no consistency. >> >> I think most will assert-reject unicode types and in -O just ignore them >> and fail in some way. I haven't seen any of those doing a >> unicode->string conversion by encoding which btw is disallowed by the >> PEP anyways. > > A CGI/WSGI bridge, if no explicit checks are made to disallow stuff > other than strings, will usually attempt to write to sys.stdout > whatever you give it. Thus unicode strings can be written and > presumably default encoding is applied. > > >>> sys.stdout.write(u"abcd\n") > abcd > > One can even write buffers. > > >>> sys.stdout.write(buffer("abcd\n")) > abcd Robert, do you have any comments on the restricting of response content to bytes and not allow fallback to conversion per latin-1? I heard that in CherryPy WSGI server you are only allowing bytes. What is your rational for that at the moment? Graham From fumanchu at aminus.org Fri May 8 17:07:13 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Fri, 8 May 2009 08:07:13 -0700 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> <88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com> Message-ID: Graham Dumpleton wrote: > Robert, do you have any comments on the restricting of response > content to bytes and not allow fallback to conversion per latin-1? > > I heard that in CherryPy WSGI server you are only allowing bytes. What > is your rational for that at the moment? In Python 2.x, one could easily mix unicode strings and byte strings in the same interface, because they mostly supported the same operations. Not so in Python 3.x--byte strings are missing everything from capitalize() to zfill() [1]. I feel that choosing one type or the other is required in order to avoid mountains of if-statements in middleware (and lots of 'pass' statements if bytes are found). I decided that that single type should be byte strings because I want WSGI middleware and applications to be able to choose what encoding their output is. Passing unicode to the server would require some out-of-band method of telling the server which encoding to use per response, which seemed unacceptable. The down side, already alluded to, is that middleware cannot then call e.g. response.capitalize() or any of a number of other methods without first decoding the response. And it cannot do that reliably unless (again) the encoding which was used to produce bytes is communicated down the stack out of band. The python3 branch of CherryPy is by no means complete. I'd be happy to explore emitting unicode if we could decide on a method whereby apps could inform the server which encoding they want. Middleware which transcoded the response would need a means of overriding that. But of course, that opens a whole new can of worms if something goes wrong, because application authors want control over the error response; if the server is encoding the response, and an error occurs, there would have to be a way to pass control back up the stack to...what? whichever component last set the encoding? That road starts to get complicated very quickly. If some middleware needs to treat the response as unicode, I'd rather emit bytes and somehow return the encoding as part of the response. Perhaps WSGI 2's mythical "return (status, headers, body-iterable, encoding)". Middleware could then decode/transcode as desired. I can't think of a downside to that, other than some lost cycles spent de/encoding, but perhaps there are some I don't yet foresee. Robert Brewer fumanchu at aminus.org [1] See http://docs.python.org/dev/py3k/library/stdtypes.html#string-methods -------------- next part -------------- An HTML attachment was scrubbed... URL: From pje at telecommunity.com Fri May 8 17:58:28 2009 From: pje at telecommunity.com (P.J. Eby) Date: Fri, 08 May 2009 11:58:28 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> <88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com> Message-ID: <20090508155551.662113A4109@sparrow.telecommunity.com> At 08:07 AM 5/8/2009 -0700, Robert Brewer wrote: >I decided that that single type should be byte strings because I want >WSGI middleware and applications to be able to choose what encoding >their output is. Passing unicode to the server would require some >out-of-band method of telling the server which encoding to use per >response, which seemed unacceptable. I find the above baffling, since PEP 333 explicitly states that when using unicode types, they're not actually supposed to *be* unicode -- they're just bytes decoded with latin-1. So, the server doesn't need to know "what encoding to use" -- it's latin-1, plain and simple. (And it's an error for an application to produce a unicode string that can't be encoded as latin-1.) To be even more specific: an application that produces strings can "choose what encoding to use" by encoding in it, then decoding those bytes via latin-1. (This is more or less what Jython and IronPython users are doing already, I believe.) From fumanchu at aminus.org Fri May 8 19:37:10 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Fri, 8 May 2009 10:37:10 -0700 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <20090508155551.662113A4109@sparrow.telecommunity.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> <88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com> <20090508155551.662113A4109@sparrow.telecommunity.com> Message-ID: P.J. Eby wrote: > At 08:07 AM 5/8/2009 -0700, Robert Brewer wrote: >> I decided that that single type should be byte strings because I want >> WSGI middleware and applications to be able to choose what encoding >> their output is. Passing unicode to the server would require some >> out-of-band method of telling the server which encoding to use per >> response, which seemed unacceptable. > > I find the above baffling, since PEP 333 explicitly states that > when using unicode types, they're not actually supposed to *be* > unicode -- they're just bytes decoded with latin-1. It also explicitly states that "HTTP does not directly support Unicode, and neither does this interface. All encoding/decoding must be handled by the application; all strings passed to or from the server must be standard Python BYTE STRINGS (emphasis mine), not Unicode objects. The result of using a Unicode object where a string object is required, is undefined." PEP 333 is difficult to interpret because it uses the name "str" synonymously with the concept "byte string", which Python 3000 defies. I believe the intent was to differentiate unicode from bytes, not elevate whatever type happens to be called "str" on your Python du jour. It was and is a mistake to standardize on type names ("str") across platforms and not on type behavior ("byte string"). If Python3 WSGI apps emit unicode strings (py3k type 'str'), you're effectively saying the server will always call "chunk.encode('latin-1')". That negates any benefit of using unicode as the type for the response. That's not "supporting unicode"; that's using unicode exactly as if it were an opaque byte string. That's seems silly to me when there is a perfectly useful byte string type. > So, the server doesn't need to know "what encoding to use" -- it's > latin-1, plain and simple. (And it's an error for an application to > produce a unicode string that can't be encoded as latin-1.) > > To be even more specific: an application that produces strings can > "choose what encoding to use" by encoding in it, then decoding those > bytes via latin-1. (This is more or less what Jython and IronPython > users are doing already, I believe.) That may make sense for Jython and IronPython if they truly do not have a usable byte string type. But it doesn't make as much sense for Python3 which has a usable byte string type. My way: App Server --- ------ bchunk = uchunk.encode('utf-8') yield bchunk write(bchunk) Your way: App Server --- ------ bchunk = uchunk.encode('utf-8') uchunk = chunk.decode('latin-1') yield uchunk bchunk = uchunk.encode('latin-1') write(bchunk) I don't see any benefit to that. Robert Brewer fumanchu at aminus.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From foom at fuhm.net Fri May 8 20:39:53 2009 From: foom at fuhm.net (James Y Knight) Date: Fri, 8 May 2009 14:39:53 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> <88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com> <20090508155551.662113A4109@sparrow.telecommunity.com> Message-ID: On May 8, 2009, at 1:37 PM, Robert Brewer wrote: > If Python3 WSGI apps emit unicode strings (py3k type 'str'), you're > effectively saying the server will always call > "chunk.encode('latin-1')". That negates any benefit of using unicode > as > the type for the response. That's not "supporting unicode"; that's > using > unicode exactly as if it were an opaque byte string. That's seems > silly > to me when there is a perfectly useful byte string type. Agreed. Accepting py3k "str" and always encoding in latin-1 is basically just undoing the separation of unicode&byte-strings that was one of Py3k's major design goals. Probably there should be nothing in WSGI should be allowed to be given as either bytestring or character string. The spec should choose one or the other for each circumstance. And for body content it's clear that the only sane thing is a bytestring. From pje at telecommunity.com Sat May 9 00:00:47 2009 From: pje at telecommunity.com (P.J. Eby) Date: Fri, 08 May 2009 18:00:47 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> <88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com> <20090508155551.662113A4109@sparrow.telecommunity.com> Message-ID: <20090508215809.E6C5B3A40A5@sparrow.telecommunity.com> At 10:37 AM 5/8/2009 -0700, Robert Brewer wrote: >It also explicitly states that "HTTP does not directly support Unicode, >and neither does this interface. All encoding/decoding must be handled >by the application; all strings passed to or from the server must be >standard Python BYTE STRINGS (emphasis mine), not Unicode objects. The >result of using a Unicode object where a string object is required, is >undefined." It also says what the interpretation is when 'str' is a unicode string type. >PEP 333 is difficult to interpret because it uses the name "str" >synonymously with the concept "byte string", which Python 3000 defies. I >believe the intent was to differentiate unicode from bytes, not elevate >whatever type happens to be called "str" on your Python du jour. It was >and is a mistake to standardize on type names ("str") across platforms >and not on type behavior ("byte string"). Ironically, 'str' is what's consistent in type behavior; the bytes type doesn't supply the same operations. >If Python3 WSGI apps emit unicode strings (py3k type 'str'), you're >effectively saying the server will always call >"chunk.encode('latin-1')". That negates any benefit of using unicode as >the type for the response. That's not "supporting unicode"; that's using >unicode exactly as if it were an opaque byte string. That's seems silly >to me when there is a perfectly useful byte string type. Compatibility sometimes demands we do silly things. Personally, I think it's kind of silly that Python 3 files return incompatible data types depending on what mode you open them in, but there's not a whole lot we can do about that. Meanwhile, existing WSGI code ported to Python 3 is going to yield strings until/unless manually converted; AFAIK 2to3 has no way to automatically detect WSGI-ness and convert your strings to bytes. >I don't see any benefit to that. There isn't any benefit to doing it by *hand*. However, backward compatibility demands that servers *accept* such strings, as they may be generated by legacy apps. That's why the Python 3 WSGI amendments say servers MUST accept this, even thought applications SHOULD supply bytes. That is, for new code, we do want bytes. What we don't want, ever, is unicode characters above #255 in any unicode strings sent as part of the response body. From pje at telecommunity.com Sat May 9 00:02:52 2009 From: pje at telecommunity.com (P.J. Eby) Date: Fri, 08 May 2009 18:02:52 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> <88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com> <20090508155551.662113A4109@sparrow.telecommunity.com> Message-ID: <20090508220014.CE93F3A40A5@sparrow.telecommunity.com> At 02:39 PM 5/8/2009 -0400, James Y Knight wrote: >On May 8, 2009, at 1:37 PM, Robert Brewer wrote: >>If Python3 WSGI apps emit unicode strings (py3k type 'str'), you're >>effectively saying the server will always call >>"chunk.encode('latin-1')". That negates any benefit of using unicode >>as >>the type for the response. That's not "supporting unicode"; that's >>using >>unicode exactly as if it were an opaque byte string. That's seems >>silly >>to me when there is a perfectly useful byte string type. > >Agreed. Accepting py3k "str" and always encoding in latin-1 is >basically just undoing the separation of unicode&byte-strings that was >one of Py3k's major design goals. > >Probably there should be nothing in WSGI should be allowed to be given >as either bytestring or character string. The spec should choose one >or the other for each circumstance. And for body content it's clear >that the only sane thing is a bytestring. With the amendments as written (and previously discussed here), accepting latin-1 (or ASCII-only) strings allows backward compatibility with code converted via 2to3. Otherwise, you would have to track down every string-returning function in your program that *might* be used to generate a response or a yielded portion thereof. From foom at fuhm.net Sat May 9 00:05:38 2009 From: foom at fuhm.net (James Y Knight) Date: Fri, 8 May 2009 18:05:38 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <20090508215809.E6C5B3A40A5@sparrow.telecommunity.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> <88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com> <20090508155551.662113A4109@sparrow.telecommunity.com> <20090508215809.E6C5B3A40A5@sparrow.telecommunity.com> Message-ID: On May 8, 2009, at 6:00 PM, P.J. Eby wrote: > Compatibility sometimes demands we do silly things. Personally, I > think it's kind of silly that Python 3 files return incompatible > data types depending on what mode you open them in, but there's not > a whole lot we can do about that. > > Meanwhile, existing WSGI code ported to Python 3 is going to yield > strings until/unless manually converted; AFAIK 2to3 has no way to > automatically detect WSGI-ness and convert your strings to bytes. Yes, 2to3 doesn't work for any non-trivial app... You have this same exact issue with straight-up sockets! Why should WSGI be the odd-man- out here and accept strings when you should've passed a bytestring, when nothing else in python 3 does that, and has the exact same backwards-compat problems? James From pje at telecommunity.com Sat May 9 00:58:19 2009 From: pje at telecommunity.com (P.J. Eby) Date: Fri, 08 May 2009 18:58:19 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> <88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com> <20090508155551.662113A4109@sparrow.telecommunity.com> <20090508215809.E6C5B3A40A5@sparrow.telecommunity.com> Message-ID: <20090508225543.B194E3A40A5@sparrow.telecommunity.com> At 06:05 PM 5/8/2009 -0400, James Y Knight wrote: >On May 8, 2009, at 6:00 PM, P.J. Eby wrote: >>Compatibility sometimes demands we do silly things. Personally, I >>think it's kind of silly that Python 3 files return incompatible >>data types depending on what mode you open them in, but there's not >>a whole lot we can do about that. >> >>Meanwhile, existing WSGI code ported to Python 3 is going to yield >>strings until/unless manually converted; AFAIK 2to3 has no way to >>automatically detect WSGI-ness and convert your strings to bytes. > >Yes, 2to3 doesn't work for any non-trivial app... You have this same >exact issue with straight-up sockets! Why should WSGI be the >odd-man- out here and accept strings when you should've passed a bytestring, >when nothing else in python 3 does that, and has the exact same >backwards-compat problems? Hell if I know. I'm just explaining (possibly incorrectly) why the consensus went that way last time we discussed it here... a consensus that I thought you were part of actually, but maybe my memory is faulty. (Hell, it happened so long ago that at one point I forgot we'd ever discussed it in the first place!) I'm going back to the sidelines now, to rant about the good old days when all we had were 'str' and 'unicode' (and we liked it), and then yell at some teenagers to get off my lawn. ;-) From foom at fuhm.net Sat May 9 00:59:56 2009 From: foom at fuhm.net (James Y Knight) Date: Fri, 8 May 2009 18:59:56 -0400 Subject: [Web-SIG] Python 3.0 and WSGI 1.0. In-Reply-To: <20090508225543.B194E3A40A5@sparrow.telecommunity.com> References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> <88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com> <20090508155551.662113A4109@sparrow.telecommunity.com> <20090508215809.E6C5B3A40A5@sparrow.telecommunity.com> <20090508225543.B194E3A40A5@sparrow.telecommunity.com> Message-ID: <399BEDB0-D28C-49BB-BAE3-7148B36483F2@fuhm.net> On May 8, 2009, at 6:58 PM, P.J. Eby wrote: > Hell if I know. I'm just explaining (possibly incorrectly) why the > consensus went that way last time we discussed it here... a > consensus that I thought you were part of actually, but maybe my > memory is faulty. (Hell, it happened so long ago that at one point > I forgot we'd ever discussed it in the first place!) For all I know I might've been, my memory is equally fuzzy about that discussion. Humans are crazy beings, they can sometimes change their mind without even realizing they've done so! > I'm going back to the sidelines now, to rant about the good old days > when all we had were 'str' and 'unicode' (and we liked it), and then > yell at some teenagers to get off my lawn. ;-) :) James From fumanchu at aminus.org Mon May 11 18:53:51 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Mon, 11 May 2009 09:53:51 -0700 Subject: [Web-SIG] py3k, cgi, email, and form-data Message-ID: There's a major change in functionality in the cgi module between Python 2 and Python 3 which I've just run across: the behavior of FieldStorage.read_multi, specifically when an HTTP app accepts a file upload within a multipart/form-data payload. In Python 2, each part would be read in sequence within its own FieldStorage instance. This allowed file uploads to be shunted to a TemporaryFile (via make_file) as needed: klass = self.FieldStorageClass or self.__class__ part = klass(self.fp, {}, ib, environ, keep_blank_values, strict_parsing) # Throw first part away while not part.done: headers = rfc822.Message(self.fp) part = klass(self.fp, headers, ib, environ, keep_blank_values, strict_parsing) self.list.append(part) In Python 3 (svn revision 72466), the whole request body is read into memory first via fp.read(), and then broken into separate parts in a second step: klass = self.FieldStorageClass or self.__class__ parser = email.parser.FeedParser() # Create bogus content-type header for proper multipart parsing parser.feed('Content-Type: %s; boundary=%s\r\n\r\n' % (self.type, ib)) parser.feed(self.fp.read()) full_msg = parser.close() # Get subparts msgs = full_msg.get_payload() for msg in msgs: fp = StringIO(msg.get_payload()) part = klass(fp, msg, ib, environ, keep_blank_values, strict_parsing) self.list.append(part) This makes the cgi module in Python 3 somewhat crippled for handling multipart/form-data file uploads of any significant size (and since the client is the one determining the size, opens a server up for an unexpected Denial of Service vector). I *think* the FeedParser is designed to accept incremental writes, but I haven't yet found a way to do any kind of incremental reads from it in order to shunt the fp.read out to a tempfile again. I'm secretly hoping Barry has a one-liner fix for this. ;) Robert Brewer fumanchu at aminus.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From graham.dumpleton at gmail.com Wed May 13 04:33:02 2009 From: graham.dumpleton at gmail.com (Graham Dumpleton) Date: Wed, 13 May 2009 12:33:02 +1000 Subject: [Web-SIG] py3k, cgi, email, and form-data In-Reply-To: References: Message-ID: <88e286470905121933i6b9dcffj82446098990224cc@mail.gmail.com> 2009/5/12 Robert Brewer : > There's a major change in functionality in the cgi module between Python > 2 and Python 3 which I've just run across: the behavior of > FieldStorage.read_multi, specifically when an HTTP app accepts a file > upload within a multipart/form-data payload. > > In Python 2, each part would be read in sequence within its own > FieldStorage instance. This allowed file uploads to be shunted to a > TemporaryFile (via make_file) as needed: > > ??? klass = self.FieldStorageClass or self.__class__ > ??? part = klass(self.fp, {}, ib, > ???????????????? environ, keep_blank_values, strict_parsing) > ??? # Throw first part away > ??? while not part.done: > ??????? headers = rfc822.Message(self.fp) > ??????? part = klass(self.fp, headers, ib, > ???????????????????? environ, keep_blank_values, strict_parsing) > ??????? self.list.append(part) > > In Python 3 (svn revision 72466), the whole request body is read into > memory first via fp.read(), and then broken into separate parts in a > second step: > > ??? klass = self.FieldStorageClass or self.__class__ > ??? parser = email.parser.FeedParser() > ??? # Create bogus content-type header for proper multipart parsing > ??? parser.feed('Content-Type: %s; boundary=%s\r\n\r\n' % (self.type, ib)) > ??? parser.feed(self.fp.read()) > ??? full_msg = parser.close() > ??? # Get subparts > ??? msgs = full_msg.get_payload() > ??? for msg in msgs: > ??????? fp = StringIO(msg.get_payload()) > ??????? part = klass(fp, msg, ib, environ, keep_blank_values, > ???????????????????? strict_parsing) > ??????? self.list.append(part) > > This makes the cgi module in Python 3 somewhat crippled for handling > multipart/form-data file uploads of any significant size (and since > the client is the one determining the size, opens a server up for an > unexpected Denial of Service vector). > > I *think* the FeedParser is designed to accept incremental writes, > but I haven't yet found a way to do any kind of incremental reads > from it in order to shunt the fp.read out to a tempfile again. > I'm secretly hoping Barry has a one-liner fix for this. ;) FWIW, Werkzeug gave up on 'cgi' module for form passing and implements its own. Not sure whether this issue in Python 3.0 was one of the reasons or not. I know one of the reasons was because cgi.FieldStorage is not WSGI 1.0 compliant. One of the main reasons that no one actually adheres to WSGI 1.0 is because of the 'cgi' module. This still hasn't been addressed by a proper amendment to WSGI 1.0 specification or a new WSGI 1.1 specification to allow a hint to readline(). The Werkzeug form processing module is properly WSGI 1.0 compliant, meaning that Wekzeug is possibly the only major WSGI framework to be WSGI compliant. Graham From fumanchu at aminus.org Wed May 13 05:43:21 2009 From: fumanchu at aminus.org (Robert Brewer) Date: Tue, 12 May 2009 20:43:21 -0700 Subject: [Web-SIG] py3k, cgi, email, and form-data In-Reply-To: <88e286470905121933i6b9dcffj82446098990224cc@mail.gmail.com> References: <88e286470905121933i6b9dcffj82446098990224cc@mail.gmail.com> Message-ID: Graham Dumpleton wrote: > 2009/5/12 Robert Brewer : > > There's a major change in functionality in the cgi module between > Python > > 2 and Python 3 which I've just run across: the behavior of > > FieldStorage.read_multi, specifically when an HTTP app accepts a file > > upload within a multipart/form-data payload. > > > > In Python 2, each part would be read in sequence within its own > > FieldStorage instance. This allowed file uploads to be shunted to a > > TemporaryFile (via make_file) as needed: > > > > ??? klass = self.FieldStorageClass or self.__class__ > > ??? part = klass(self.fp, {}, ib, > > ???????????????? environ, keep_blank_values, strict_parsing) > > ??? # Throw first part away > > ??? while not part.done: > > ??????? headers = rfc822.Message(self.fp) > > ??????? part = klass(self.fp, headers, ib, > > ???????????????????? environ, keep_blank_values, strict_parsing) > > ??????? self.list.append(part) > > > > In Python 3 (svn revision 72466), the whole request body is read into > > memory first via fp.read(), and then broken into separate parts in a > > second step: > > > > ??? klass = self.FieldStorageClass or self.__class__ > > ??? parser = email.parser.FeedParser() > > ??? # Create bogus content-type header for proper multipart parsing > > ??? parser.feed('Content-Type: %s; boundary=%s\r\n\r\n' % (self.type, > ib)) > > ??? parser.feed(self.fp.read()) > > ??? full_msg = parser.close() > > ??? # Get subparts > > ??? msgs = full_msg.get_payload() > > ??? for msg in msgs: > > ??????? fp = StringIO(msg.get_payload()) > > ??????? part = klass(fp, msg, ib, environ, keep_blank_values, > > ???????????????????? strict_parsing) > > ??????? self.list.append(part) > > > > This makes the cgi module in Python 3 somewhat crippled for handling > > multipart/form-data file uploads of any significant size (and since > > the client is the one determining the size, opens a server up for an > > unexpected Denial of Service vector). > > > > I *think* the FeedParser is designed to accept incremental writes, > > but I haven't yet found a way to do any kind of incremental reads > > from it in order to shunt the fp.read out to a tempfile again. > > I'm secretly hoping Barry has a one-liner fix for this. ;) > > FWIW, Werkzeug gave up on 'cgi' module for form passing and implements > its own. > > Not sure whether this issue in Python 3.0 was one of the reasons or > not. I know one of the reasons was because cgi.FieldStorage is not > WSGI 1.0 compliant. One of the main reasons that no one actually > adheres to WSGI 1.0 is because of the 'cgi' module. This still hasn't > been addressed by a proper amendment to WSGI 1.0 specification or a > new WSGI 1.1 specification to allow a hint to readline(). > > The Werkzeug form processing module is properly WSGI 1.0 compliant, > meaning that Wekzeug is possibly the only major WSGI framework to be > WSGI compliant. FWIW, I just added a replacement for the cgi module to CherryPy over the weekend for the same reasons. It's in the python3 branch but will get backported to CherryPy 3.2 for Python 2.x. Robert Brewer fumanchu at aminus.org From daywednes at gmail.com Sat May 23 20:53:10 2009 From: daywednes at gmail.com (Minh Doan) Date: Sat, 23 May 2009 11:53:10 -0700 Subject: [Web-SIG] web programming, Message-ID: <9c76a0930905231153j5dc66123j355287fa1ca43d69@mail.gmail.com> Hi, I'm a newbie to python. I am having stuck with the following problem. I want to download the info(price) from fromcity to tocity at a certain time from kayak.com website. If we do it manually, we can go to the website, choose the appropriate info we want to get and press SEARCH. How can i do it in python ? I hope someone could help me deal with the problem. Thanks ---- Minh Doan -------------- next part -------------- An HTML attachment was scrubbed... URL: From pstradomski at gmail.com Sat May 23 21:45:03 2009 From: pstradomski at gmail.com (=?utf-8?q?Pawe=C5=82_Stradomski?=) Date: Sat, 23 May 2009 21:45:03 +0200 Subject: [Web-SIG] web programming, In-Reply-To: <9c76a0930905231153j5dc66123j355287fa1ca43d69@mail.gmail.com> References: <9c76a0930905231153j5dc66123j355287fa1ca43d69@mail.gmail.com> Message-ID: <200905232145.03701.pstradomski@gmail.com> W li?cie Minh Doan z dnia sobota 23 maja 2009: > Hi, > > I'm a newbie to python. I am having stuck with the following problem. I > want to download the info(price) from fromcity to tocity at a certain time > from kayak.com website. If we do it manually, we can go to the website, > choose the appropriate info we want to get and press SEARCH. How can i do > it in python ? > Try urllib or urllib2 and BeautifulSoap. -- Pawe? Stradomski From omar.website at gmail.com Sun May 31 18:30:27 2009 From: omar.website at gmail.com (Omar Munk) Date: Sun, 31 May 2009 16:30:27 -0000 Subject: [Web-SIG] Web Framework Message-ID: <7f559f2d0905310930k607a346as3d9984c45975c642@mail.gmail.com> Hello I'm Pynthon and I'm 14 years old. I'm coming from Holland so my English isn't very good. I'm looking for a good Python webframework. I liked Web2Py but it always can be better. I don't need a full admin app included. I just want to code it in my text editor just like PHP. Do you guys know a framework with: - A good documentation. - Not to overkill like Django - Easy and simple - Just something like PHP but without the dirty style. - I like Karrigell but it looks like it's dead do you know a clone of it? - Not need a VPS to host it, just a server that has Python. I know it's almost impposbile but I seached everywhere! And creating your own is that hard? Thanks, Pynthon -------------- next part -------------- An HTML attachment was scrubbed... URL: