From armin.ronacher at active-4.com  Mon May  4 17:02:48 2009
From: armin.ronacher at active-4.com (Armin Ronacher)
Date: Mon, 4 May 2009 15:02:48 +0000 (UTC)
Subject: [Web-SIG] Python 3.0 and WSGI 1.0.
References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com>
Message-ID: <loom.20090504T142950-148@post.gmane.org>

Hello everybody,

I just recently started looking at supporting Python 3 with one of my libraries
(Werkzeug), mainly because the MoinMoin projects considers using it which uses
the library in question.  Right now what Werkzeug does is consider HTTP being
Unicode aware in the sense that everything that carries text data is encoded and
decoded into a known encoding.

This is partially against the specification and not entirely correct, but it
works the best on modern browsers and is also what Django and Paste are doing.

It's basically that the incoming request data is .decode(encoding)d (usually
utf-8) before passed to the user code and unicode data is encoded back into the
same encoding before it's sent to the server.

Now why is the current behavior of Python 3 a problem here?  The encode, decode
hack from above is obviously a solution for these kinds of applications, albeit
not a good one.  Interfaces like mod_wsgi already have the data as bytestring,
would decode it from latin1 just that the application can encode it back and
decode as utf-8.  Not only is this slow but also does this mean that the code
does not survive a run through 2to3.

Now you could argue that the libraries where wrong in the first place and should
support unicode strings that were encoded from latin1 and decoded, but seems
like very few libraries support that.

Now which strings carry data that could contain non-ascii characters from a
source with an unknown encoding?  Right now these are the following:

  * PATH_INFO
  * SCRIPT_NAME
  * QUERY_STRING
  * CONTENT_TYPE
  * HTTP_*

Also all headers that carry non integer values (like HTTP_CONTENT_TYPE and
CONTENT_TYPE).  Now it's true that the headers should not contain non latin1
values but reality shows that they do.  Cookies are transmitted as headers as
well and no browser complains if you put utf-8 encoded stuff into it.  It may be
the case that for the browser this looks like latin1, but in the end the
application decodes it from utf-8 and is happy.

Data sent from the application can continue to work like they do currently. 
However for django, Werkzeug, paste and many others that support unicode output
will just check if the output is unicode, and if that's the case, encode to the
desired encoding.

Also people abuse middlewares a lot and they deal with incoming and outgoing
data as well.  One can expect these middlewares to work on known encodings as
well so those would do the encode / decode dance too.

If one knows the encoding of the environ, then the webserver.  Apparently there
are issues getting the encoding of the environ but those won't go away when
moving that to the web application.

Because of that I propose that Python 3 would ship a version of wsgiref with
Python 3.1 that uses bytestrings for the headers in question and add a section
on Python 3 compatibility based on that to PEP 333.

I volunteer for writing a new section on Python 3 in PEP 333 :-)


Regards,
Armin


From graham.dumpleton at gmail.com  Tue May  5 02:21:14 2009
From: graham.dumpleton at gmail.com (Graham Dumpleton)
Date: Tue, 5 May 2009 10:21:14 +1000
Subject: [Web-SIG] Python 3.0 and WSGI 1.0.
In-Reply-To: <loom.20090504T142950-148@post.gmane.org>
References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com>
	<loom.20090504T142950-148@post.gmane.org>
Message-ID: <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com>

2009/5/5 Armin Ronacher <armin.ronacher at active-4.com>:
> Hello everybody,
>
> I just recently started looking at supporting Python 3 with one of my libraries
> (Werkzeug), mainly because the MoinMoin projects considers using it which uses
> the library in question. ?Right now what Werkzeug does is consider HTTP being
> Unicode aware in the sense that everything that carries text data is encoded and
> decoded into a known encoding.
>
> This is partially against the specification and not entirely correct, but it
> works the best on modern browsers and is also what Django and Paste are doing.
>
> It's basically that the incoming request data is .decode(encoding)d (usually
> utf-8) before passed to the user code and unicode data is encoded back into the
> same encoding before it's sent to the server.
>
> Now why is the current behavior of Python 3 a problem here? ?The encode, decode
> hack from above is obviously a solution for these kinds of applications, albeit
> not a good one. ?Interfaces like mod_wsgi already have the data as bytestring,
> would decode it from latin1 just that the application can encode it back and
> decode as utf-8. ?Not only is this slow but also does this mean that the code
> does not survive a run through 2to3.
>
> Now you could argue that the libraries where wrong in the first place and should
> support unicode strings that were encoded from latin1 and decoded, but seems
> like very few libraries support that.
>
> Now which strings carry data that could contain non-ascii characters from a
> source with an unknown encoding? ?Right now these are the following:
>
> ?* PATH_INFO
> ?* SCRIPT_NAME
> ?* QUERY_STRING
> ?* CONTENT_TYPE
> ?* HTTP_*

Depending on underlying web server that WSGI adapter runs on, there
might also be:

  REQUEST_URI
  PATH_TRANSLATED (??)

Yes I know these aren't required for WSGI, except to the extent that
WSGI specification says:

  "A server or gateway should attempt to provide as many other CGI
variables as are applicable."

Would have to check CGI but there may be more.

The way I thus read this is that keys are always strings, values will
be strings, except for specific list of entries where values would be
bytes. Also, presume that wsgi.url_scheme will have string value.

Where things get difficult for me with Apache is where users can use
SetEnv or mod_rewrite to define additional key/values to be added to
the WSGI environment. For example:

  SetEnv trac.env_path /some/path

I can't see but have choice but to pass such settings through as
strings, else more than likely would cause problems for applications.
Problem is it isn't clear what encoding stuff can be in Apache
configuration. At the moment latin-1 is assumed.

Things though get more complicated when mod_rewrite is used, as the
values could be derived from components of the URL which are being
treated as bytes above. For example:

 RewriteCond %{THE_REQUEST} ^\ *([A-Z]+)\ *(.*)\ *(HTTP/.*)$
 RewriteRule . - [E=UNPARSERD_URI:%1]

So, this is creating a new UNPARSED_URI value which is original URL as
appeared in the request line. I can't know that strictly speaking that
this should be bytes.

As such, I think all I can do is always pass through additional values
as string, interpreted as latin-1. If some special case handling is
required, would be up to WSGI application. I am not too keen on
special configuration directives to allow encoding and/or whether
bytes used, to be specified for each possible variable being set.

Anyway, this is special case stuff and if being done is likely going
to be special to Apache/mod_wsgi. If people want consistency, they
should just implement it as a WSGI middleware where they can rather
than usind mod_rewrite fiddles.

Now, if we are going to start using bytes for request headers, there
is the other question of response data.

The original proposal in amendments was that application should
provide bytes, but that WSGI adapter must accept either bytes or
strings, with strings interpreted as latin-1.

Is there sense in being more strict in this case?

In Python 2.X some WSGI adapters only allow Python 2.X strings (ie.,
bytes) and reject unicode strings. Others will convert unicode
strings, but rather than use latin-1, apply the default Python
encoding. Thus, there is no consistency.

As to wsgi.file_wrapper, the only logical thing seems to be required
file object to return bytes, ie. raw mode, and not be in text mode.

Ultimately I am just implementing the WSGI adapter, I'll follow
whatever is decided. I am not in a position, since I don't develop
stuff that runs on it, to know what is best. So, as long as it is
clear what should be passed through as bytes for environment, ie.,
there is an all inclusive list, and don't somehow have to guess, then
am fine either way. I'd just like to see some decision and for that
decision not to be some time next year as am holding up mod_wsgi 3.0
until things have been clarified. :-(

Graham

> Also all headers that carry non integer values (like HTTP_CONTENT_TYPE and
> CONTENT_TYPE). ?Now it's true that the headers should not contain non latin1
> values but reality shows that they do. ?Cookies are transmitted as headers as
> well and no browser complains if you put utf-8 encoded stuff into it. ?It may be
> the case that for the browser this looks like latin1, but in the end the
> application decodes it from utf-8 and is happy.
>
> Data sent from the application can continue to work like they do currently.
> However for django, Werkzeug, paste and many others that support unicode output
> will just check if the output is unicode, and if that's the case, encode to the
> desired encoding.
>
> Also people abuse middlewares a lot and they deal with incoming and outgoing
> data as well. ?One can expect these middlewares to work on known encodings as
> well so those would do the encode / decode dance too.
>
> If one knows the encoding of the environ, then the webserver. ?Apparently there
> are issues getting the encoding of the environ but those won't go away when
> moving that to the web application.
>
> Because of that I propose that Python 3 would ship a version of wsgiref with
> Python 3.1 that uses bytestrings for the headers in question and add a section
> on Python 3 compatibility based on that to PEP 333.
>
> I volunteer for writing a new section on Python 3 in PEP 333 :-)
>
>
> Regards,
> Armin
>
> _______________________________________________
> Web-SIG mailing list
> Web-SIG at python.org
> Web SIG: http://www.python.org/sigs/web-sig
> Unsubscribe: http://mail.python.org/mailman/options/web-sig/graham.dumpleton%40gmail.com
>

From graham.dumpleton at gmail.com  Tue May  5 12:04:02 2009
From: graham.dumpleton at gmail.com (Graham Dumpleton)
Date: Tue, 5 May 2009 20:04:02 +1000
Subject: [Web-SIG] Python 3.0 and WSGI 1.0.
In-Reply-To: <4A000694.9070401@active-4.com>
References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com>
	<loom.20090504T142950-148@post.gmane.org>
	<88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com>
	<4A000694.9070401@active-4.com>
Message-ID: <88e286470905050304l4483ec52n9085fbce6b6912cf@mail.gmail.com>

2009/5/5 Armin Ronacher <armin.ronacher at active-4.com>:
> Hi,
>
> Graham Dumpleton wrote:
>> I can't see but have choice but to pass such settings through as
>> strings, else more than likely would cause problems for applications.
>> Problem is it isn't clear what encoding stuff can be in Apache
>> configuration. At the moment latin-1 is assumed.
>
> Because those information does not have a specified encoding I can see
> nothing wrong with it passing that information as bytestrings. ?I would
> have no problem passing *all* values as bytestrings.

At what point does that become an inconvenience though? I guess that
is my concern, because if one has to do too many manual conversions in
an application, people will start to complain it becomes unwieldy to
use. In other words, you make it easier or more logical for
frameworks, but do you end up putting more burden on applications for
stuff outside those core values.

So, for those core CGI values which the framework is going to modify
even before an application sees them, then fine. Is the framework also
going to set the rules as to what encoding is used for other values in
the WSGI environment and convert them per that encoding when an
application requests them, or is the application always going to have
to deal with them as bytes?

As I keep saying, you guys who write the frameworks and applications
are going to know better than I, I am just challenging the notions as
a way of making people think about it so the end result is what is the
most logical thing to do. ;-)

>> In Python 2.X some WSGI adapters only allow Python 2.X strings (ie.,
>> bytes) and reject unicode strings. Others will convert unicode
>> strings, but rather than use latin-1, apply the default Python
>> encoding. Thus, there is no consistency.
>
> I think most will assert-reject unicode types and in -O just ignore them
> and fail in some way. ?I haven't seen any of those doing a
> unicode->string conversion by encoding which btw is disallowed by the
> PEP anyways.

A CGI/WSGI bridge, if no explicit checks are made to disallow stuff
other than strings, will usually attempt to write to sys.stdout
whatever you give it. Thus unicode strings can be written and
presumably default encoding is applied.

>>> sys.stdout.write(u"abcd\n")
abcd

One can even write buffers.

>>> sys.stdout.write(buffer("abcd\n"))
abcd

>> Ultimately I am just implementing the WSGI adapter, I'll follow
>> whatever is decided. I am not in a position, since I don't develop
>> stuff that runs on it, to know what is best. So, as long as it is
>> clear what should be passed through as bytes for environment, ie.,
>> there is an all inclusive list, and don't somehow have to guess, then
>> am fine either way. I'd just like to see some decision and for that
>> decision not to be some time next year as am holding up mod_wsgi 3.0
>> until things have been clarified. :-(
>
> I hope we can find a solution for that before the Python 3.1 release,
> otherwise there is another wsgiref release with the current behavior
> which is just wrong.

We can hope, but I'm not holding my breath.

It is going to be rather stupid though if what ends up being the
standard is dictated by how wsgiref works in 3.1 as is.

Graham

From fumanchu at aminus.org  Tue May  5 16:55:51 2009
From: fumanchu at aminus.org (Robert Brewer)
Date: Tue, 5 May 2009 07:55:51 -0700
Subject: [Web-SIG] Python 3.0 and WSGI 1.0.
In-Reply-To: <88e286470905050304l4483ec52n9085fbce6b6912cf@mail.gmail.com>
References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com>	<loom.20090504T142950-148@post.gmane.org>	<88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com>	<4A000694.9070401@active-4.com>
	<88e286470905050304l4483ec52n9085fbce6b6912cf@mail.gmail.com>
Message-ID: <F1962646D3B64642B7C9A06068EE1E6418B3B4@ex10.hostedexchange.local>

Graham Dumpleton wrote:
> 2009/5/5 Armin Ronacher <armin.ronacher at active-4.com>:
>> Graham Dumpleton wrote:
>>> I can't see but have choice but to pass such settings through as
>>> strings, else more than likely would cause problems for applications.
>>> Problem is it isn't clear what encoding stuff can be in Apache
>>> configuration. At the moment latin-1 is assumed.
>> Because those information does not have a specified encoding I can see
>> nothing wrong with it passing that information as bytestrings.  I would
>> have no problem passing *all* values as bytestrings.
> 
> At what point does that become an inconvenience though? I guess that
> is my concern, because if one has to do too many manual conversions in
> an application, people will start to complain it becomes unwieldy to
> use. In other words, you make it easier or more logical for
> frameworks, but do you end up putting more burden on applications for
> stuff outside those core values.
> 
> So, for those core CGI values which the framework is going to modify
> even before an application sees them, then fine. Is the framework also
> going to set the rules as to what encoding is used for other values in
> the WSGI environment and convert them per that encoding when an
> application requests them, or is the application always going to have
> to deal with them as bytes?
> 
> As I keep saying, you guys who write the frameworks and applications
> are going to know better than I, I am just challenging the notions as
> a way of making people think about it so the end result is what is the
> most logical thing to do. ;-)

In short: it's pretty easy for a framework to default to utf-8 for 
everything, yet give application developers ways to override that. See, 
for example, the cherrypy.tools.encoding Tool in our python3 
branch--it's moved from running "sometime" after the page handler, to 
wrapping the page handler so all page handlers emit bytes. That makes it 
possible for everyone to use unicode strings everywhere, yet still allow 
some to specify exact bytes as necessary. In shorter: don't worry about 
that part, we've got it covered. ;)


Robert Brewer
fumanchu at aminus.org


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20090505/27c3a7f5/attachment.htm>

From ianb at colorstudy.com  Tue May  5 19:01:07 2009
From: ianb at colorstudy.com (Ian Bicking)
Date: Tue, 5 May 2009 12:01:07 -0500
Subject: [Web-SIG] Python 3.0 and WSGI 1.0.
In-Reply-To: <F1962646D3B64642B7C9A06068EE1E6418B3B4@ex10.hostedexchange.local>
References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> 
	<loom.20090504T142950-148@post.gmane.org>
	<88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> 
	<4A000694.9070401@active-4.com>
	<88e286470905050304l4483ec52n9085fbce6b6912cf@mail.gmail.com> 
	<F1962646D3B64642B7C9A06068EE1E6418B3B4@ex10.hostedexchange.local>
Message-ID: <b654cd2e0905051001ja5c8c9dpa6983e48e87c3ef8@mail.gmail.com>

Philip Jenvey brought this to my attention:

  http://www.python.org/dev/peps/pep-0383/

It's a UTF8 encoding and decoding scheme that encodes illegal bytes in such
a way that you can decode to get the original bytes object, and thus
transcode to another encoding.  It's intended for cases exactly like WSGI.

-- 
Ian Bicking  |  http://blog.ianbicking.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20090505/8ab82c32/attachment.htm>

From graham.dumpleton at gmail.com  Wed May  6 05:14:04 2009
From: graham.dumpleton at gmail.com (Graham Dumpleton)
Date: Wed, 6 May 2009 13:14:04 +1000
Subject: [Web-SIG] Python 3.0 and WSGI 1.0.
In-Reply-To: <b654cd2e0905051001ja5c8c9dpa6983e48e87c3ef8@mail.gmail.com>
References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com>
	<loom.20090504T142950-148@post.gmane.org>
	<88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com>
	<4A000694.9070401@active-4.com>
	<88e286470905050304l4483ec52n9085fbce6b6912cf@mail.gmail.com>
	<F1962646D3B64642B7C9A06068EE1E6418B3B4@ex10.hostedexchange.local>
	<b654cd2e0905051001ja5c8c9dpa6983e48e87c3ef8@mail.gmail.com>
Message-ID: <88e286470905052014h342e58c7m58dc655a5b4be543@mail.gmail.com>

2009/5/6 Ian Bicking <ianb at colorstudy.com>:
> Philip Jenvey brought this to my attention:
>
> ? http://www.python.org/dev/peps/pep-0383/
>
> It's a UTF8 encoding and decoding scheme that encodes illegal bytes in such
> a way that you can decode to get the original bytes object, and thus
> transcode to another encoding.? It's intended for cases exactly like WSGI.

Care to explain then how that would in practice be used while I try
and reread it a few times to try and understand it myself? :-)

Graham

From ianb at colorstudy.com  Wed May  6 05:27:17 2009
From: ianb at colorstudy.com (Ian Bicking)
Date: Tue, 5 May 2009 22:27:17 -0500
Subject: [Web-SIG] Python 3.0 and WSGI 1.0.
In-Reply-To: <88e286470905052014h342e58c7m58dc655a5b4be543@mail.gmail.com>
References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com> 
	<loom.20090504T142950-148@post.gmane.org>
	<88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com> 
	<4A000694.9070401@active-4.com>
	<88e286470905050304l4483ec52n9085fbce6b6912cf@mail.gmail.com> 
	<F1962646D3B64642B7C9A06068EE1E6418B3B4@ex10.hostedexchange.local> 
	<b654cd2e0905051001ja5c8c9dpa6983e48e87c3ef8@mail.gmail.com> 
	<88e286470905052014h342e58c7m58dc655a5b4be543@mail.gmail.com>
Message-ID: <b654cd2e0905052027r680fb16s551e6bb9159ec9b1@mail.gmail.com>

On Tue, May 5, 2009 at 10:14 PM, Graham Dumpleton <
graham.dumpleton at gmail.com> wrote:

> 2009/5/6 Ian Bicking <ianb at colorstudy.com>:
> > Philip Jenvey brought this to my attention:
> >
> >   http://www.python.org/dev/peps/pep-0383/
> >
> > It's a UTF8 encoding and decoding scheme that encodes illegal bytes in
> such
> > a way that you can decode to get the original bytes object, and thus
> > transcode to another encoding.  It's intended for cases exactly like
> WSGI.
>
> Care to explain then how that would in practice be used while I try
> and reread it a few times to try and understand it myself? :-)
>

I don't particularly know, except I think you'd do things like:

environ['PATH_INFO'] = urllib.unquote(http_byte_path).decode('utf8',
'python-escape')

Then if the encoding was wrong, you could transcode like:

environ['PATH_INFO'] = environ['PATH_INFO'].encode('utf8',
'python-escape').decode('latin1', 'python-escape')

Note that you need to know the encoding that was used (utf8 in this case)
and that python-escape was used.  It has been suggested that the server
should put the encoding it used into the environment.  When transcoding this
should also be updated.

It's not clear what python-escape is going to do, I don't think that's been
determined.  Probably it'll put \x00 or something in the unicode string to
mark raw bytes.

-- 
Ian Bicking  |  http://blog.ianbicking.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20090505/6c3c6bc7/attachment-0001.htm>

From graham.dumpleton at gmail.com  Fri May  8 13:34:51 2009
From: graham.dumpleton at gmail.com (Graham Dumpleton)
Date: Fri, 8 May 2009 21:34:51 +1000
Subject: [Web-SIG] Python 3.0 and WSGI 1.0.
In-Reply-To: <88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com>
References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com>
	<loom.20090504T142950-148@post.gmane.org>
	<88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com>
Message-ID: <88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com>

2009/5/5 Graham Dumpleton <graham.dumpleton at gmail.com>
>>> Now, if we are going to start using bytes for request headers, there
>>> is the other question of response data.
>>>
>>> The original proposal in amendments was that application should
>>> provide bytes, but that WSGI adapter must accept either bytes or
>>> strings, with strings interpreted as latin-1.
>>>
>>> Is there sense in being more strict in this case?
>>>
>>> In Python 2.X some WSGI adapters only allow Python 2.X strings (ie.,
>>> bytes) and reject unicode strings. Others will convert unicode
>>> strings, but rather than use latin-1, apply the default Python
>>> encoding. Thus, there is no consistency.
>>
>> I think most will assert-reject unicode types and in -O just ignore them
>> and fail in some way.  I haven't seen any of those doing a
>> unicode->string conversion by encoding which btw is disallowed by the
>> PEP anyways.
>
> A CGI/WSGI bridge, if no explicit checks are made to disallow stuff
> other than strings, will usually attempt to write to sys.stdout
> whatever you give it. Thus unicode strings can be written and
> presumably default encoding is applied.
>
> >>> sys.stdout.write(u"abcd\n")
> abcd
>
> One can even write buffers.
>
> >>> sys.stdout.write(buffer("abcd\n"))
> abcd

Robert, do you have any comments on the restricting of response
content to bytes and not allow fallback to conversion per latin-1?

I heard that in CherryPy WSGI server you are only allowing bytes. What
is your rational for that at the moment?

Graham

From fumanchu at aminus.org  Fri May  8 17:07:13 2009
From: fumanchu at aminus.org (Robert Brewer)
Date: Fri, 8 May 2009 08:07:13 -0700
Subject: [Web-SIG] Python 3.0 and WSGI 1.0.
In-Reply-To: <88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com>
References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com>	
	<loom.20090504T142950-148@post.gmane.org>	
	<88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com>
	<88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com>
Message-ID: <F1962646D3B64642B7C9A06068EE1E6418B3C5@ex10.hostedexchange.local>

Graham Dumpleton wrote:
> Robert, do you have any comments on the restricting of response
> content to bytes and not allow fallback to conversion per latin-1?
> 
> I heard that in CherryPy WSGI server you are only allowing bytes. What
> is your rational for that at the moment?


In Python 2.x, one could easily mix unicode strings and byte strings in
the same interface, because they mostly supported the same operations.
Not so in Python 3.x--byte strings are missing everything from
capitalize() to zfill() [1]. I feel that choosing one type or the other
is required in order to avoid mountains of if-statements in middleware
(and lots of 'pass' statements if bytes are found).

I decided that that single type should be byte strings because I want
WSGI middleware and applications to be able to choose what encoding
their output is. Passing unicode to the server would require some
out-of-band method of telling the server which encoding to use per
response, which seemed unacceptable.

The down side, already alluded to, is that middleware cannot then call
e.g. response.capitalize() or any of a number of other methods without
first decoding the response. And it cannot do that reliably unless
(again) the encoding which was used to produce bytes is communicated
down the stack out of band.

The python3 branch of CherryPy is by no means complete. I'd be happy to
explore emitting unicode if we could decide on a method whereby apps
could inform the server which encoding they want. Middleware which
transcoded the response would need a means of overriding that. But of
course, that opens a whole new can of worms if something goes wrong,
because application authors want control over the error response; if the
server is encoding the response, and an error occurs, there would have
to be a way to pass control back up the stack to...what? whichever
component last set the encoding? That road starts to get complicated
very quickly.

If some middleware needs to treat the response as unicode, I'd rather
emit bytes and somehow return the encoding as part of the response.
Perhaps WSGI 2's mythical "return (status, headers, body-iterable,
encoding)". Middleware could then decode/transcode as desired. I can't
think of a downside to that, other than some lost cycles spent
de/encoding, but perhaps there are some I don't yet foresee.


Robert Brewer
fumanchu at aminus.org

[1] See http://docs.python.org/dev/py3k/library/stdtypes.html#string-methods
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20090508/fa5fc7cf/attachment.htm>

From pje at telecommunity.com  Fri May  8 17:58:28 2009
From: pje at telecommunity.com (P.J. Eby)
Date: Fri, 08 May 2009 11:58:28 -0400
Subject: [Web-SIG] Python 3.0 and WSGI 1.0.
In-Reply-To: <F1962646D3B64642B7C9A06068EE1E6418B3C5@ex10.hostedexchange
	.local>
References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com>
	<loom.20090504T142950-148@post.gmane.org>
	<88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com>
	<88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com>
	<F1962646D3B64642B7C9A06068EE1E6418B3C5@ex10.hostedexchange.local>
Message-ID: <20090508155551.662113A4109@sparrow.telecommunity.com>

At 08:07 AM 5/8/2009 -0700, Robert Brewer wrote:
>I decided that that single type should be byte strings because I want
>WSGI middleware and applications to be able to choose what encoding
>their output is. Passing unicode to the server would require some
>out-of-band method of telling the server which encoding to use per
>response, which seemed unacceptable.

I find the above baffling, since PEP 333 explicitly states that when 
using unicode types, they're not actually supposed to *be* unicode 
--  they're just bytes decoded with latin-1.

So, the server doesn't need to know "what encoding to use" -- it's 
latin-1, plain and simple.  (And it's an error for an application to 
produce a unicode string that can't be encoded as latin-1.)

To be even more specific: an application that produces strings can 
"choose what encoding to use" by encoding in it, then decoding those 
bytes via latin-1.  (This is more or less what Jython and IronPython 
users are doing already, I believe.)


From fumanchu at aminus.org  Fri May  8 19:37:10 2009
From: fumanchu at aminus.org (Robert Brewer)
Date: Fri, 8 May 2009 10:37:10 -0700
Subject: [Web-SIG] Python 3.0 and WSGI 1.0.
In-Reply-To: <20090508155551.662113A4109@sparrow.telecommunity.com>
References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com>
	<loom.20090504T142950-148@post.gmane.org>
	<88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com>
	<88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com>
	<F1962646D3B64642B7C9A06068EE1E6418B3C5@ex10.hostedexchange.local>
	<20090508155551.662113A4109@sparrow.telecommunity.com>
Message-ID: <F1962646D3B64642B7C9A06068EE1E6418B3D0@ex10.hostedexchange.local>

P.J. Eby wrote:
> At 08:07 AM 5/8/2009 -0700, Robert Brewer wrote:
>> I decided that that single type should be byte strings because I want
>> WSGI middleware and applications to be able to choose what encoding
>> their output is. Passing unicode to the server would require some
>> out-of-band method of telling the server which encoding to use per
>> response, which seemed unacceptable.
> 
> I find the above baffling, since PEP 333 explicitly states that
> when using unicode types, they're not actually supposed to *be*
> unicode -- they're just bytes decoded with latin-1.

It also explicitly states that "HTTP does not directly support Unicode,
and neither does this interface. All encoding/decoding must be handled
by the application; all strings passed to or from the server must be
standard Python BYTE STRINGS (emphasis mine), not Unicode objects. The
result of using a Unicode object where a string object is required, is
undefined."

PEP 333 is difficult to interpret because it uses the name "str"
synonymously with the concept "byte string", which Python 3000 defies. I
believe the intent was to differentiate unicode from bytes, not elevate
whatever type happens to be called "str" on your Python du jour. It was
and is a mistake to standardize on type names ("str") across platforms
and not on type behavior ("byte string").

If Python3 WSGI apps emit unicode strings (py3k type 'str'), you're
effectively saying the server will always call
"chunk.encode('latin-1')". That negates any benefit of using unicode as
the type for the response. That's not "supporting unicode"; that's using
unicode exactly as if it were an opaque byte string. That's seems silly
to me when there is a perfectly useful byte string type.

> So, the server doesn't need to know "what encoding to use" -- it's
> latin-1, plain and simple.  (And it's an error for an application to
> produce a unicode string that can't be encoded as latin-1.)
>
> To be even more specific: an application that produces strings can
> "choose what encoding to use" by encoding in it, then decoding those
> bytes via latin-1.  (This is more or less what Jython and IronPython
> users are doing already, I believe.)

That may make sense for Jython and IronPython if they truly do not have
a usable byte string type. But it doesn't make as much sense for Python3
which has a usable byte string type. My way:

    App                                Server
    ---                                ------
    bchunk = uchunk.encode('utf-8')
    yield bchunk
                                       write(bchunk)

Your way:

    App                                Server
    ---                                ------
    bchunk = uchunk.encode('utf-8')
    uchunk = chunk.decode('latin-1')
    yield uchunk
                                       bchunk = uchunk.encode('latin-1')
                                       write(bchunk)

I don't see any benefit to that.


Robert Brewer
fumanchu at aminus.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20090508/33f90ee5/attachment-0001.htm>

From foom at fuhm.net  Fri May  8 20:39:53 2009
From: foom at fuhm.net (James Y Knight)
Date: Fri, 8 May 2009 14:39:53 -0400
Subject: [Web-SIG] Python 3.0 and WSGI 1.0.
In-Reply-To: <F1962646D3B64642B7C9A06068EE1E6418B3D0@ex10.hostedexchange.local>
References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com>
	<loom.20090504T142950-148@post.gmane.org>
	<88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com>
	<88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com>
	<F1962646D3B64642B7C9A06068EE1E6418B3C5@ex10.hostedexchange.local>
	<20090508155551.662113A4109@sparrow.telecommunity.com>
	<F1962646D3B64642B7C9A06068EE1E6418B3D0@ex10.hostedexchange.local>
Message-ID: <A4675490-C6FF-4332-9159-D1557328D069@fuhm.net>

On May 8, 2009, at 1:37 PM, Robert Brewer wrote:
> If Python3 WSGI apps emit unicode strings (py3k type 'str'), you're
> effectively saying the server will always call
> "chunk.encode('latin-1')". That negates any benefit of using unicode  
> as
> the type for the response. That's not "supporting unicode"; that's  
> using
> unicode exactly as if it were an opaque byte string. That's seems  
> silly
> to me when there is a perfectly useful byte string type.

Agreed. Accepting py3k "str" and always encoding in latin-1 is  
basically just undoing the separation of unicode&byte-strings that was  
one of Py3k's major design goals.

Probably there should be nothing in WSGI should be allowed to be given  
as either bytestring or character string. The spec should choose one  
or the other for each circumstance. And for body content it's clear  
that the only sane thing is a bytestring.

From pje at telecommunity.com  Sat May  9 00:00:47 2009
From: pje at telecommunity.com (P.J. Eby)
Date: Fri, 08 May 2009 18:00:47 -0400
Subject: [Web-SIG] Python 3.0 and WSGI 1.0.
In-Reply-To: <F1962646D3B64642B7C9A06068EE1E6418B3D0@ex10.hostedexchange
	.local>
References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com>
	<loom.20090504T142950-148@post.gmane.org>
	<88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com>
	<88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com>
	<F1962646D3B64642B7C9A06068EE1E6418B3C5@ex10.hostedexchange.local>
	<20090508155551.662113A4109@sparrow.telecommunity.com>
	<F1962646D3B64642B7C9A06068EE1E6418B3D0@ex10.hostedexchange.local>
Message-ID: <20090508215809.E6C5B3A40A5@sparrow.telecommunity.com>

At 10:37 AM 5/8/2009 -0700, Robert Brewer wrote:
>It also explicitly states that "HTTP does not directly support Unicode,
>and neither does this interface. All encoding/decoding must be handled
>by the application; all strings passed to or from the server must be
>standard Python BYTE STRINGS (emphasis mine), not Unicode objects. The
>result of using a Unicode object where a string object is required, is
>undefined."

It also says what the interpretation is when 'str' is a unicode string type.

>PEP 333 is difficult to interpret because it uses the name "str"
>synonymously with the concept "byte string", which Python 3000 defies. I
>believe the intent was to differentiate unicode from bytes, not elevate
>whatever type happens to be called "str" on your Python du jour. It was
>and is a mistake to standardize on type names ("str") across platforms
>and not on type behavior ("byte string").

Ironically, 'str' is what's consistent in type behavior; the bytes 
type doesn't supply the same operations.


>If Python3 WSGI apps emit unicode strings (py3k type 'str'), you're
>effectively saying the server will always call
>"chunk.encode('latin-1')". That negates any benefit of using unicode as
>the type for the response. That's not "supporting unicode"; that's using
>unicode exactly as if it were an opaque byte string. That's seems silly
>to me when there is a perfectly useful byte string type.

Compatibility sometimes demands we do silly things.  Personally, I 
think it's kind of silly that Python 3 files return incompatible data 
types depending on what mode you open them in, but there's not a 
whole lot we can do about that.

Meanwhile, existing WSGI code ported to Python 3 is going to yield 
strings until/unless manually converted; AFAIK 2to3 has no way to 
automatically detect WSGI-ness and convert your strings to bytes.


>I don't see any benefit to that.

There isn't any benefit to doing it by *hand*.  However, backward 
compatibility demands that servers *accept* such strings, as they may 
be generated by legacy apps.

That's why the Python 3 WSGI amendments say servers MUST accept this, 
even thought applications SHOULD supply bytes.

That is, for new code, we do want bytes.  What we don't want, ever, 
is unicode characters above #255 in any unicode strings sent as part 
of the response body.


From pje at telecommunity.com  Sat May  9 00:02:52 2009
From: pje at telecommunity.com (P.J. Eby)
Date: Fri, 08 May 2009 18:02:52 -0400
Subject: [Web-SIG] Python 3.0 and WSGI 1.0.
In-Reply-To: <A4675490-C6FF-4332-9159-D1557328D069@fuhm.net>
References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com>
	<loom.20090504T142950-148@post.gmane.org>
	<88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com>
	<88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com>
	<F1962646D3B64642B7C9A06068EE1E6418B3C5@ex10.hostedexchange.local>
	<20090508155551.662113A4109@sparrow.telecommunity.com>
	<F1962646D3B64642B7C9A06068EE1E6418B3D0@ex10.hostedexchange.local>
	<A4675490-C6FF-4332-9159-D1557328D069@fuhm.net>
Message-ID: <20090508220014.CE93F3A40A5@sparrow.telecommunity.com>

At 02:39 PM 5/8/2009 -0400, James Y Knight wrote:
>On May 8, 2009, at 1:37 PM, Robert Brewer wrote:
>>If Python3 WSGI apps emit unicode strings (py3k type 'str'), you're
>>effectively saying the server will always call
>>"chunk.encode('latin-1')". That negates any benefit of using unicode
>>as
>>the type for the response. That's not "supporting unicode"; that's
>>using
>>unicode exactly as if it were an opaque byte string. That's seems
>>silly
>>to me when there is a perfectly useful byte string type.
>
>Agreed. Accepting py3k "str" and always encoding in latin-1 is
>basically just undoing the separation of unicode&byte-strings that was
>one of Py3k's major design goals.
>
>Probably there should be nothing in WSGI should be allowed to be given
>as either bytestring or character string. The spec should choose one
>or the other for each circumstance. And for body content it's clear
>that the only sane thing is a bytestring.

With the amendments as written (and previously discussed here), 
accepting latin-1 (or ASCII-only) strings allows backward 
compatibility with code converted via 2to3.  Otherwise, you would 
have to track down every string-returning function in your program 
that *might* be used to generate a response or a yielded portion thereof.


From foom at fuhm.net  Sat May  9 00:05:38 2009
From: foom at fuhm.net (James Y Knight)
Date: Fri, 8 May 2009 18:05:38 -0400
Subject: [Web-SIG] Python 3.0 and WSGI 1.0.
In-Reply-To: <20090508215809.E6C5B3A40A5@sparrow.telecommunity.com>
References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com>
	<loom.20090504T142950-148@post.gmane.org>
	<88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com>
	<88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com>
	<F1962646D3B64642B7C9A06068EE1E6418B3C5@ex10.hostedexchange.local>
	<20090508155551.662113A4109@sparrow.telecommunity.com>
	<F1962646D3B64642B7C9A06068EE1E6418B3D0@ex10.hostedexchange.local>
	<20090508215809.E6C5B3A40A5@sparrow.telecommunity.com>
Message-ID: <ED50C155-116B-4FCB-B7DD-DFE791B4C22A@fuhm.net>

On May 8, 2009, at 6:00 PM, P.J. Eby wrote:
> Compatibility sometimes demands we do silly things.  Personally, I  
> think it's kind of silly that Python 3 files return incompatible  
> data types depending on what mode you open them in, but there's not  
> a whole lot we can do about that.
>
> Meanwhile, existing WSGI code ported to Python 3 is going to yield  
> strings until/unless manually converted; AFAIK 2to3 has no way to  
> automatically detect WSGI-ness and convert your strings to bytes.

Yes, 2to3 doesn't work for any non-trivial app... You have this same  
exact issue with straight-up sockets! Why should WSGI be the odd-man- 
out here and accept strings when you should've passed a bytestring,  
when nothing else in python 3 does that, and has the exact same  
backwards-compat problems?

James

From pje at telecommunity.com  Sat May  9 00:58:19 2009
From: pje at telecommunity.com (P.J. Eby)
Date: Fri, 08 May 2009 18:58:19 -0400
Subject: [Web-SIG] Python 3.0 and WSGI 1.0.
In-Reply-To: <ED50C155-116B-4FCB-B7DD-DFE791B4C22A@fuhm.net>
References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com>
	<loom.20090504T142950-148@post.gmane.org>
	<88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com>
	<88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com>
	<F1962646D3B64642B7C9A06068EE1E6418B3C5@ex10.hostedexchange.local>
	<20090508155551.662113A4109@sparrow.telecommunity.com>
	<F1962646D3B64642B7C9A06068EE1E6418B3D0@ex10.hostedexchange.local>
	<20090508215809.E6C5B3A40A5@sparrow.telecommunity.com>
	<ED50C155-116B-4FCB-B7DD-DFE791B4C22A@fuhm.net>
Message-ID: <20090508225543.B194E3A40A5@sparrow.telecommunity.com>

At 06:05 PM 5/8/2009 -0400, James Y Knight wrote:
>On May 8, 2009, at 6:00 PM, P.J. Eby wrote:
>>Compatibility sometimes demands we do silly things.  Personally, I
>>think it's kind of silly that Python 3 files return incompatible
>>data types depending on what mode you open them in, but there's not
>>a whole lot we can do about that.
>>
>>Meanwhile, existing WSGI code ported to Python 3 is going to yield
>>strings until/unless manually converted; AFAIK 2to3 has no way to
>>automatically detect WSGI-ness and convert your strings to bytes.
>
>Yes, 2to3 doesn't work for any non-trivial app... You have this same
>exact issue with straight-up sockets! Why should WSGI be the 
>odd-man- out here and accept strings when you should've passed a bytestring,
>when nothing else in python 3 does that, and has the exact same
>backwards-compat problems?

Hell if I know.  I'm just explaining (possibly incorrectly) why the 
consensus went that way last time we discussed it here...  a 
consensus that I thought you were part of actually, but maybe my 
memory is faulty.  (Hell, it happened so long ago that at one point I 
forgot we'd ever discussed it in the first place!)

I'm going back to the sidelines now, to rant about the good old days 
when all we had were 'str' and 'unicode' (and we liked it), and then 
yell at some teenagers to get off my lawn.  ;-)


From foom at fuhm.net  Sat May  9 00:59:56 2009
From: foom at fuhm.net (James Y Knight)
Date: Fri, 8 May 2009 18:59:56 -0400
Subject: [Web-SIG] Python 3.0 and WSGI 1.0.
In-Reply-To: <20090508225543.B194E3A40A5@sparrow.telecommunity.com>
References: <88e286470904010329r5222c37bl73ab5dd234ac29de@mail.gmail.com>
	<loom.20090504T142950-148@post.gmane.org>
	<88e286470905041721s75a80dc1xf8df4da293449b75@mail.gmail.com>
	<88e286470905080434m60f33ad6v1abb21a55d3d303f@mail.gmail.com>
	<F1962646D3B64642B7C9A06068EE1E6418B3C5@ex10.hostedexchange.local>
	<20090508155551.662113A4109@sparrow.telecommunity.com>
	<F1962646D3B64642B7C9A06068EE1E6418B3D0@ex10.hostedexchange.local>
	<20090508215809.E6C5B3A40A5@sparrow.telecommunity.com>
	<ED50C155-116B-4FCB-B7DD-DFE791B4C22A@fuhm.net>
	<20090508225543.B194E3A40A5@sparrow.telecommunity.com>
Message-ID: <399BEDB0-D28C-49BB-BAE3-7148B36483F2@fuhm.net>

On May 8, 2009, at 6:58 PM, P.J. Eby wrote:
> Hell if I know.  I'm just explaining (possibly incorrectly) why the  
> consensus went that way last time we discussed it here...  a  
> consensus that I thought you were part of actually, but maybe my  
> memory is faulty.  (Hell, it happened so long ago that at one point  
> I forgot we'd ever discussed it in the first place!)

For all I know I might've been, my memory is equally fuzzy about that  
discussion. Humans are crazy beings, they can sometimes change their  
mind without even realizing they've done so!

> I'm going back to the sidelines now, to rant about the good old days  
> when all we had were 'str' and 'unicode' (and we liked it), and then  
> yell at some teenagers to get off my lawn.  ;-)

:)

James

From fumanchu at aminus.org  Mon May 11 18:53:51 2009
From: fumanchu at aminus.org (Robert Brewer)
Date: Mon, 11 May 2009 09:53:51 -0700
Subject: [Web-SIG] py3k, cgi, email, and form-data
Message-ID: <F1962646D3B64642B7C9A06068EE1E6418B3DA@ex10.hostedexchange.local>

There's a major change in functionality in the cgi module between Python
2 and Python 3 which I've just run across: the behavior of
FieldStorage.read_multi, specifically when an HTTP app accepts a file
upload within a multipart/form-data payload.

In Python 2, each part would be read in sequence within its own
FieldStorage instance. This allowed file uploads to be shunted to a
TemporaryFile (via make_file) as needed:

    klass = self.FieldStorageClass or self.__class__
    part = klass(self.fp, {}, ib,
                 environ, keep_blank_values, strict_parsing)
    # Throw first part away
    while not part.done:
        headers = rfc822.Message(self.fp)
        part = klass(self.fp, headers, ib,
                     environ, keep_blank_values, strict_parsing)
        self.list.append(part)

In Python 3 (svn revision 72466), the whole request body is read into
memory first via fp.read(), and then broken into separate parts in a
second step:

    klass = self.FieldStorageClass or self.__class__
    parser = email.parser.FeedParser()
    # Create bogus content-type header for proper multipart parsing
    parser.feed('Content-Type: %s; boundary=%s\r\n\r\n' % (self.type, ib))
    parser.feed(self.fp.read())
    full_msg = parser.close()
    # Get subparts
    msgs = full_msg.get_payload()
    for msg in msgs:
        fp = StringIO(msg.get_payload())
        part = klass(fp, msg, ib, environ, keep_blank_values,
                     strict_parsing)
        self.list.append(part)

This makes the cgi module in Python 3 somewhat crippled for handling
multipart/form-data file uploads of any significant size (and since
the client is the one determining the size, opens a server up for an
unexpected Denial of Service vector).

I *think* the FeedParser is designed to accept incremental writes,
but I haven't yet found a way to do any kind of incremental reads
from it in order to shunt the fp.read out to a tempfile again.
I'm secretly hoping Barry has a one-liner fix for this. ;)


Robert Brewer
fumanchu at aminus.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20090511/0a75edc6/attachment.htm>

From graham.dumpleton at gmail.com  Wed May 13 04:33:02 2009
From: graham.dumpleton at gmail.com (Graham Dumpleton)
Date: Wed, 13 May 2009 12:33:02 +1000
Subject: [Web-SIG] py3k, cgi, email, and form-data
In-Reply-To: <F1962646D3B64642B7C9A06068EE1E6418B3DA@ex10.hostedexchange.local>
References: <AcnSWQ/GR3W2RBf3RAKfzKnEHXpWuQ==>
	<F1962646D3B64642B7C9A06068EE1E6418B3DA@ex10.hostedexchange.local>
Message-ID: <88e286470905121933i6b9dcffj82446098990224cc@mail.gmail.com>

2009/5/12 Robert Brewer <fumanchu at aminus.org>:
> There's a major change in functionality in the cgi module between Python
> 2 and Python 3 which I've just run across: the behavior of
> FieldStorage.read_multi, specifically when an HTTP app accepts a file
> upload within a multipart/form-data payload.
>
> In Python 2, each part would be read in sequence within its own
> FieldStorage instance. This allowed file uploads to be shunted to a
> TemporaryFile (via make_file) as needed:
>
> ??? klass = self.FieldStorageClass or self.__class__
> ??? part = klass(self.fp, {}, ib,
> ???????????????? environ, keep_blank_values, strict_parsing)
> ??? # Throw first part away
> ??? while not part.done:
> ??????? headers = rfc822.Message(self.fp)
> ??????? part = klass(self.fp, headers, ib,
> ???????????????????? environ, keep_blank_values, strict_parsing)
> ??????? self.list.append(part)
>
> In Python 3 (svn revision 72466), the whole request body is read into
> memory first via fp.read(), and then broken into separate parts in a
> second step:
>
> ??? klass = self.FieldStorageClass or self.__class__
> ??? parser = email.parser.FeedParser()
> ??? # Create bogus content-type header for proper multipart parsing
> ??? parser.feed('Content-Type: %s; boundary=%s\r\n\r\n' % (self.type, ib))
> ??? parser.feed(self.fp.read())
> ??? full_msg = parser.close()
> ??? # Get subparts
> ??? msgs = full_msg.get_payload()
> ??? for msg in msgs:
> ??????? fp = StringIO(msg.get_payload())
> ??????? part = klass(fp, msg, ib, environ, keep_blank_values,
> ???????????????????? strict_parsing)
> ??????? self.list.append(part)
>
> This makes the cgi module in Python 3 somewhat crippled for handling
> multipart/form-data file uploads of any significant size (and since
> the client is the one determining the size, opens a server up for an
> unexpected Denial of Service vector).
>
> I *think* the FeedParser is designed to accept incremental writes,
> but I haven't yet found a way to do any kind of incremental reads
> from it in order to shunt the fp.read out to a tempfile again.
> I'm secretly hoping Barry has a one-liner fix for this. ;)

FWIW, Werkzeug gave up on 'cgi' module for form passing and implements its own.

Not sure whether this issue in Python 3.0 was one of the reasons or
not. I know one of the reasons was because cgi.FieldStorage is not
WSGI 1.0 compliant. One of the main reasons that no one actually
adheres to WSGI 1.0 is because of the 'cgi' module. This still hasn't
been addressed by a proper amendment to WSGI 1.0 specification or a
new WSGI 1.1 specification to allow a hint to readline().

The Werkzeug form processing module is properly WSGI 1.0 compliant,
meaning that Wekzeug is possibly the only major WSGI framework to be
WSGI compliant.

Graham

From fumanchu at aminus.org  Wed May 13 05:43:21 2009
From: fumanchu at aminus.org (Robert Brewer)
Date: Tue, 12 May 2009 20:43:21 -0700
Subject: [Web-SIG] py3k, cgi, email, and form-data
In-Reply-To: <88e286470905121933i6b9dcffj82446098990224cc@mail.gmail.com>
References: <AcnSWQ/GR3W2RBf3RAKfzKnEHXpWuQ==>
	<F1962646D3B64642B7C9A06068EE1E6418B3DA@ex10.hostedexchange.local>
	<88e286470905121933i6b9dcffj82446098990224cc@mail.gmail.com>
Message-ID: <F1962646D3B64642B7C9A06068EE1E64085736FB@ex10.hostedexchange.local>

Graham Dumpleton wrote:
> 2009/5/12 Robert Brewer <fumanchu at aminus.org>:
> > There's a major change in functionality in the cgi module between
> Python
> > 2 and Python 3 which I've just run across: the behavior of
> > FieldStorage.read_multi, specifically when an HTTP app accepts a file
> > upload within a multipart/form-data payload.
> >
> > In Python 2, each part would be read in sequence within its own
> > FieldStorage instance. This allowed file uploads to be shunted to a
> > TemporaryFile (via make_file) as needed:
> >
> > ??? klass = self.FieldStorageClass or self.__class__
> > ??? part = klass(self.fp, {}, ib,
> > ???????????????? environ, keep_blank_values, strict_parsing)
> > ??? # Throw first part away
> > ??? while not part.done:
> > ??????? headers = rfc822.Message(self.fp)
> > ??????? part = klass(self.fp, headers, ib,
> > ???????????????????? environ, keep_blank_values, strict_parsing)
> > ??????? self.list.append(part)
> >
> > In Python 3 (svn revision 72466), the whole request body is read into
> > memory first via fp.read(), and then broken into separate parts in a
> > second step:
> >
> > ??? klass = self.FieldStorageClass or self.__class__
> > ??? parser = email.parser.FeedParser()
> > ??? # Create bogus content-type header for proper multipart parsing
> > ??? parser.feed('Content-Type: %s; boundary=%s\r\n\r\n' % (self.type,
> ib))
> > ??? parser.feed(self.fp.read())
> > ??? full_msg = parser.close()
> > ??? # Get subparts
> > ??? msgs = full_msg.get_payload()
> > ??? for msg in msgs:
> > ??????? fp = StringIO(msg.get_payload())
> > ??????? part = klass(fp, msg, ib, environ, keep_blank_values,
> > ???????????????????? strict_parsing)
> > ??????? self.list.append(part)
> >
> > This makes the cgi module in Python 3 somewhat crippled for handling
> > multipart/form-data file uploads of any significant size (and since
> > the client is the one determining the size, opens a server up for an
> > unexpected Denial of Service vector).
> >
> > I *think* the FeedParser is designed to accept incremental writes,
> > but I haven't yet found a way to do any kind of incremental reads
> > from it in order to shunt the fp.read out to a tempfile again.
> > I'm secretly hoping Barry has a one-liner fix for this. ;)
> 
> FWIW, Werkzeug gave up on 'cgi' module for form passing and implements
> its own.
> 
> Not sure whether this issue in Python 3.0 was one of the reasons or
> not. I know one of the reasons was because cgi.FieldStorage is not
> WSGI 1.0 compliant. One of the main reasons that no one actually
> adheres to WSGI 1.0 is because of the 'cgi' module. This still hasn't
> been addressed by a proper amendment to WSGI 1.0 specification or a
> new WSGI 1.1 specification to allow a hint to readline().
> 
> The Werkzeug form processing module is properly WSGI 1.0 compliant,
> meaning that Wekzeug is possibly the only major WSGI framework to be
> WSGI compliant.

FWIW, I just added a replacement for the cgi module to CherryPy over the weekend for the same reasons. It's in the python3 branch but will get backported to CherryPy 3.2 for Python 2.x.


Robert Brewer
fumanchu at aminus.org

From daywednes at gmail.com  Sat May 23 20:53:10 2009
From: daywednes at gmail.com (Minh Doan)
Date: Sat, 23 May 2009 11:53:10 -0700
Subject: [Web-SIG] web programming,
Message-ID: <9c76a0930905231153j5dc66123j355287fa1ca43d69@mail.gmail.com>

Hi,

I'm a newbie to python. I am having stuck with the following problem. I want
to download the info(price) from fromcity to tocity at a certain time from
kayak.com website. If we do it manually, we can go to the website, choose
the appropriate info we want to get and press SEARCH. How can i do it in
python ?

I hope someone could help me deal with the problem.
Thanks
----
Minh Doan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20090523/25f0870b/attachment.htm>

From pstradomski at gmail.com  Sat May 23 21:45:03 2009
From: pstradomski at gmail.com (=?utf-8?q?Pawe=C5=82_Stradomski?=)
Date: Sat, 23 May 2009 21:45:03 +0200
Subject: [Web-SIG] web programming,
In-Reply-To: <9c76a0930905231153j5dc66123j355287fa1ca43d69@mail.gmail.com>
References: <9c76a0930905231153j5dc66123j355287fa1ca43d69@mail.gmail.com>
Message-ID: <200905232145.03701.pstradomski@gmail.com>

W li?cie Minh Doan z dnia sobota 23 maja 2009:
> Hi,
>
> I'm a newbie to python. I am having stuck with the following problem. I
> want to download the info(price) from fromcity to tocity at a certain time
> from kayak.com website. If we do it manually, we can go to the website,
> choose the appropriate info we want to get and press SEARCH. How can i do
> it in python ?
>

Try urllib or urllib2 and BeautifulSoap.


-- 
Pawe? Stradomski


From omar.website at gmail.com  Sun May 31 18:30:27 2009
From: omar.website at gmail.com (Omar Munk)
Date: Sun, 31 May 2009 16:30:27 -0000
Subject: [Web-SIG] Web Framework
Message-ID: <7f559f2d0905310930k607a346as3d9984c45975c642@mail.gmail.com>

Hello

I'm Pynthon and I'm 14 years old. I'm coming from Holland so my English
isn't very good. I'm looking for a good Python webframework. I liked Web2Py
but it always can be better. I don't need a full admin app included. I just
want to code it in my text editor just like PHP. Do you guys know a
framework with:


   - A good documentation.
   - Not to overkill like Django
   - Easy and simple
   - Just something like PHP but without the dirty style.
   - I like Karrigell but it looks like it's dead do you know a clone of it?
   - Not need a VPS to host it, just a server that has Python.

I know it's almost impposbile but I seached everywhere! And creating your
own is that hard?

Thanks,
Pynthon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20090531/266f7678/attachment-0001.htm>