Web servers, bytes, str, documentation, Python 3.2a4
So maybe this is the wrong forum, if so please tell me what the right forum is for each of the various pieces. I'm assuming that I should file some bugs in the tracker, but I'm not exactly sure whether to file them on cgitb, http.server, or subprocess, or all of the above. Pretty sure there are at least some in http.server, but maybe some of those will be considered "enhancement requests" since they are long outstanding in the predecessor code. So I've been writing CGI scripts in Python behind Apache. No framework, just raw CGI. Got everything working on Python 2.6 (it's the newest that the hosting company has). Whacked at 2.6's CGIHTTPServer.py until I got an environment that would actually run CGI programs in the same sort of way that Apache does, so I can test faster, locally. Got the site working. Am happy. Now I decided to tackle porting the code to Python 3, in hopes that someday the hosting company might have it, and to see what I could learn about the "Subject:" matters, and to altruistically see if 3.2a4 has a consistent story. Um. Well. Some of me, Python 3.2a4, or its documentation is missing something. Maybe several somethings. Here's some code to ponder. import sys import traceback sys.stdout = open("sob", "wb") # WSGI sez data should be binary, so stdout should be binary??? import cgitb sys.stdout.write(b"out") fhb = open("fhb", "wb") cgitb.enable(0,"d:\temp") fhb.write("abcdef") # try writing non-binary to binary file. Expect an error, of course. Feed it to python32... d:\temp>c:\python32\python.exe test11.py Error in sys.excepthook: TypeError: 'str' does not support the buffer interface Original exception was: Traceback (most recent call last): File "d:\my\py\test11.py", line 8, in <module> fhb.write("abcdef") # try writing non-binary to binary file. Expect an err or, of course. TypeError: 'str' does not support the buffer interface So it seems that cgitb can't write to binary files, to report the error? Or how else should I interpret the Error in sys.excepthook ? So then I tweaked the code for cgitb's enjoyment: import sys import traceback sys.stdout = open("sob", "w", encoding="UTF-8") # WSGI sez data should be binary, so stdout should be binary??? import cgitb sys.stdout.write("out") fhb = open("fhb", "wb") cgitb.enable(0,"d:\temp") fhb.write("abcdef") # try writing non-binary to binary file. Expect an error, of course. Now I get the following report in the stdout file: out<!--: spam Content-Type: text/html <body bgcolor="#f0f0f8"><font color="#f0f0f8" size="-5"> --> <body bgcolor="#f0f0f8"><font color="#f0f0f8" size="-5"> --> --> </font> </font> </font> </script> </object> </blockquote> </pre> </table> </table> </table> </table> </table> </font> </font> </font><p>A problem occurred in a Python script. and the following error on the console: d:\temp>c:\python32\python.exe test12.py Error in sys.excepthook: Traceback (most recent call last): File "c:\python32\lib\tempfile.py", line 209, in _mkstemp_inner fd = _os.open(file, flags, 0o600) OSError: [Errno 22] Invalid argument Original exception was: Traceback (most recent call last): File "d:\my\py\test12.py", line 8, in <module> fhb.write("abcdef") # try writing non-binary to binary file. Expect an error, of course. TypeError: 'str' does not support the buffer interface I was expecting see a whole cgitb in sob, but no such luck. Not sure why it is trying to create a temporary file, but it seems to fail to do that. Of course, the next test, would have been to write binary data into fhb, and try to copy it to stdout, which would fail, because stdout has to not be binary to make cgitb work??? That brings me to http.server, the 3.2a4 replacement for CGIHTTPServer. There are definitely some improvements here, and some reported-but-yet-unfixed bugs. And some pitiful missing features, especially on Windows. I applied some of the whacks I had applied to CGIHTTPServer, and got some things working, but, per what I was trying to demonstrate above, there seems to be an incompatibility with the idea of using cgitb (which wants stdout open with some encoding provided) and serving binary files (which wants stdout open in binary) [this latter is supported by the WSGI spec too]. So it seems to be that there are some problems. Yet, it seems that http.server can some accept the data sent by cgitb, which comes from subprocess running my CGI script, but my CGI script fails to be able to copy a binary file to its stdout (a subprocess created PIPE). The subprocess documentation doesn't say what encoding is supplied to the PIPE-created handles, if any, but since cgitb data is accepted but binary file data is not, I infer it must be a non-binary handle, encoding unknown. The subprocess documentation doesn't document any way to specify what encoding should be used on the PIPE-created handles, either. So this isn't very enlightening. In the absence of a specification or parameter, I would have expected the PIPEs to be binary, but this seems to be experimentally false. Yet http.server, when serving plain files, seems to open them in binary mode, and transfer them successfully to the browser. And it can also accept the non-binary?? data from cgitb from my CGI script, and display it in the browser. The former comes from a file it opens in binary mode, and the latter from the subprocess PIPE in unknown mode. It seems that the socketfile.server opens the socket in "wb" mode, and encodes most data. That in turn, seems to imply that the binary data from SimpleHTTPServer files are reasonably returned, and I note the headers and such are expliticly encoded before being written to wfile... again, consistent with the socket, wfile, being in binary mode. But the data coming back from the subprocess PIPE from my CGI script seems to be acceptable to be written to wfile also, implying that the PIPEs are binary, like the absence of specifications and parameters and knowledge of pipes as being bytestreams would be expected. But then, it would seem that the cgitb output should be in binary to get into the PIPE, but it seems that using a binary stdout makes cgitb fail, in the above experiment... and I can't find any code in cgitb that does explicit encoding. So I'm confused, and it seems a little extra documentation might help decide which are the modules that have bugs or missing features, and which do not. One of the cgitb outputs from my attempt to serve the binary file claims that my CGI script's output file (which comes from a subprocess PIPE) is a TextIOWrapper with encoding cp1252. Maybe that is the default that comes when a new Python is launched, even though it gets a subprocess PIPE as stdout?
On 11/19/2010 7:48 PM, Glenn Linderman wrote:
One of the cgitb outputs from my attempt to serve the binary file claims that my CGI script's output file (which comes from a subprocess PIPE) is a TextIOWrapper with encoding cp1252. Maybe that is the default that comes when a new Python is launched, even though it gets a subprocess PIPE as stdout?
So the rather gross code below solves the cp1252 stdout problem, and also permits both strings and bytes to be written to the same file, although those two features are separable. But now that I've worked around it, it seems that subprocesss should somehow ensure that launched Python programs know they are working on a binary stream? Of course, not all programs launched are Python programs... so maybe it should be a documentation issue, but it seems to be missing from the documentation. ##################################### if sys.version_info[ 0 ] == 2: class IOMix(): def __init__( self, fh, encoding="UTF-8"): self.fh = fh def write( self, param ): if isinstance( param, unicode ): self.fh.write( param.encode( encoding )) else: self.fh.write( param ) ##################################### if sys.version_info[ 0 ] == 3: class IOMix(): def __init__( self, fh, encoding="UTF-8"): if hasattr( fh, 'buffer'): self.bio = fh.buffer fh.flush() self.last = 'b' import io self.txt = io.TextIOWrapper( self.bio, encoding, None, '\r\n') else: raise ValueError("not a buffered stream") def write( self, param ): if isinstance( param, str ): self.last = 't' self.txt.write( param ) else: if self.last == 't': self.txt.flush() self.last = 'b' self.bio.write( param ) #####################################
On 11/20/2010 3:38 AM, Éric Araujo wrote:
Hello
cgitb.enable(0,"d:\temp") Isn’t that expanded to “d:<tab>emp”?
Oops. Yes, that fixes the problem with creation of the temp file, thanks for catching that. I now get a complete report of the original error in the temp file (below). I am a bit less confused now... but it seems that there are still a number of issues. Here is an enumeration of problems I was hard pressed to make before you removed my confusion on this issue. 1. cgitb should expect to report to a binary stdout, using whatever encoding (possibly ASCII) that seems appropriate for the output that in generates. 2. Some appropriate documentation or API or both should be provided to enable a script to set "binary" mode for stdout for CGI scripts. This link <http://www.eggheadcafe.com/software/aspnet/36023550/cgi-python-3-write-raw-b...> demonstrates the confusion (wish I had found it earlier) that is encountered by such lack. One must tell msvcrt the stream is binary (I had figured that out early on), one must also sidestep the use of the cp1252 default when printing binary, one must also choose a proper text encoding corresponding to the HTTP headers sent. My second email in this thread, sent a few hours after the first, shows a convenient set of cures for all but msvcrt (as long as only "write" is used for writing. "print" support could be added, similarly). Likely something along this line is needed for stdin as well, I haven't yet experimented with uploading binary content to a CGI. One could speculate about having the Python runtime auto-detect CGI mode, but I don't know of any foolproof technique for that, and the selection of the "proper" text encoding depends on the details of the CGI, so having instead an API or two that assists with doing this sort of thing would be better; the need for documentation, at least, seems imperative. 3. subprocess documentation could be improved to point out that when using subprocess.PIPE to talk to a Python subprocess, that the communications will be in binary. Again, I don't know of any way to autodetect the subprocess environment, but if it were possible to select an appropriate encoding and use it consistently on both sides of the PIPE, that would be a convenience to its use; if not possible, documenting the issue, and providing an API to use to easily select such encodings both in client and server, would be helpful. While the layers are all there, and ".buffer" is documented for TextIOWrapper, the use of sys.stdout.buffer and the fact that it has a full set of operations isn't immediately obvious from the reference material; perhaps it is in a tutorial I haven't found, but... I was looking, and didn't find it. Of course, subprocess may launch non-Python programs; they will have their own ideas of binary vs text encoding, so it is important that it is convenient to match them on the Python side. It would be nice if subprocess had a mechanism for providing no-deadlock stdout data to the parent prior to the child terminating. A CGI implementation via subprocess shouldn't accumulate all of stdout (or all of stderr, for that matter, although less important). I don't (yet) know enough about Python threading to know if this is possible, but it certainly would be useful. 4. http.server has a number of bugs and limitations. 4a. _url_collapse_path_split seems inefficient (although I have to benchmark it against what I think would be more efficient), and for its only use within http.server it produces the wrong information, so the information has to be recombined and resplit to make it function properly, adding to the perception of inefficiency. 4b. Detection of "executable" on Windows is simply wrong. Unix execution bits do not exist. 4c. is_cgi doesn't properly handle PATHINFO parts of the path, this is the other half of 4a. The Python2.x CGIHTTPServer.py had this right, but the introduction and use of _url_collapse_path_split broke it. 4d. Searching for a ? to find an explicit query string should use .find('?') rather than .rfind('?') as there is no prohibition on using '?' within a query string, AFAIK. 4e. doesn't set the REQUEST_URI, HTTP_HOST, or HTTP_PORT environment variables for the CGI. 4f. Should not send the 200 response until it sees if the CGI sends a Status: header. 4g. Should not buffer all of stdout: subprocess.communicate is inappropriate for a web server CGI interface. The data should stream through to avoid consuming inordinate amounts of memory. The only solution within the current limitations of subprocess is to abandon stderr, force the CGI to do its own error logging, and use shutil.copyfileobj to hook up p.stdout to self.wfile once the Status: message processing has happened. 4h. Doesn't seem to close p.stdin (I'm not sure if that is necessary, it may happen when p is garbage collected, but effort was made to close p.stdout and p.stderr, which seem similar.) *TypeError* Python 3.2a4: c:\python32\python.exe Sat Nov 20 09:28:41 2010 A problem occurred in a Python script. Here is the sequence of function calls leading up to the error, in the order they occurred. d:\my\py\test12.py in **() 4 import cgitb 5 sys.stdout.write("out") 6 fhb = open("fhb", "wb") 7 cgitb.enable(0,"d:\\temp") => 8 fhb.write("abcdef") # try writing non-binary to binary file. Expect an error, of course. *fhb* = <_io.BufferedWriter name='fhb'>, fhb.*write* = <built-in method write of _io.BufferedWriter object> *TypeError*: 'str' does not support the buffer interface args = ("'str' does not support the buffer interface",) with_traceback = <built-in method with_traceback of TypeError object>
On 11/20/2010 10:19 AM, Glenn Linderman wrote:
Oops. Yes, that fixes the problem with creation of the temp file, thanks for catching that. I now get a complete report of the original error in the temp file (below). I am a bit less confused now... but it seems that there are still a number of issues. Here is an enumeration of problems I was hard pressed to make before you removed my confusion on this issue.
Related issues, regarding binary stream requirements for cgi interface. Perhaps the cgi module should have the API to set binary mode. http://bugs.python.org/issue1610654 http://bugs.python.org/issue8077 http://bugs.python.org/issue4953 Sadly, cgi.py input handling seems to depend on the email module, thought to be fixed for 3.2, but it is not clear if that has been achieved, or if the surrogate encode workaround is sufficient for this. More testing needed, but I don't have such a test case developed yet.
1. cgitb should expect to report to a binary stdout, using whatever encoding (possibly ASCII) that seems appropriate for the output that in generates.
Maybe cgi.py should have an API to set the stdin and stdout to binary streams. Although cgi.py deals more with stdin than stdout, cgitb deals more with stdout. Created http://bugs.python.org/issue10479
2. Some appropriate documentation or API or both should be provided to enable a script to set "binary" mode for stdout for CGI scripts. This link <http://www.eggheadcafe.com/software/aspnet/36023550/cgi-python-3-write-raw-b...> demonstrates the confusion (wish I had found it earlier) that is encountered by such lack. One must tell msvcrt the stream is binary (I had figured that out early on), one must also sidestep the use of the cp1252 default when printing binary, one must also choose a proper text encoding corresponding to the HTTP headers sent. My second email in this thread, sent a few hours after the first, shows a convenient set of cures for all but msvcrt (as long as only "write" is used for writing. "print" support could be added, similarly). Likely something along this line is needed for stdin as well, I haven't yet experimented with uploading binary content to a CGI.
One could speculate about having the Python runtime auto-detect CGI mode, but I don't know of any foolproof technique for that, and the selection of the "proper" text encoding depends on the details of the CGI, so having instead an API or two that assists with doing this sort of thing would be better; the need for documentation, at least, seems imperative.
Created http://bugs.python.org/issue10480
3. subprocess documentation could be improved to point out that when using subprocess.PIPE to talk to a Python subprocess, that the communications will be in binary. Again, I don't know of any way to autodetect the subprocess environment, but if it were possible to select an appropriate encoding and use it consistently on both sides of the PIPE, that would be a convenience to its use; if not possible, documenting the issue, and providing an API to use to easily select such encodings both in client and server, would be helpful.
While the layers are all there, and ".buffer" is documented for TextIOWrapper, the use of sys.stdout.buffer and the fact that it has a full set of operations isn't immediately obvious from the reference material; perhaps it is in a tutorial I haven't found, but... I was looking, and didn't find it.
Of course, subprocess may launch non-Python programs; they will have their own ideas of binary vs text encoding, so it is important that it is convenient to match them on the Python side.
It would be nice if subprocess had a mechanism for providing no-deadlock stdout data to the parent prior to the child terminating. A CGI implementation via subprocess shouldn't accumulate all of stdout (or all of stderr, for that matter, although less important). I don't (yet) know enough about Python threading to know if this is possible, but it certainly would be useful.
http://bugs.python.org/issue1048 for subprocess to document that communicate produces byte stream output. http://bugs.python.org/issue10482 for subprocess enhancements to handle more cases without deadlock. Found http://bugs.python.org/issue4571 which documents how to switch stdin/stdout/stderr to binary mode, and even back! I couldn't track the documented change to the actual documentation, though, but I did find it in section 26.1, under the documentation for the three stdio streams: def make_streams_binary(): sys.stdin = sys.stdin.detach() sys.stdout = sys.stdout.detach()
4. http.server has a number of bugs and limitations. 4a. _url_collapse_path_split seems inefficient (although I have to benchmark it against what I think would be more efficient), and for its only use within http.server it produces the wrong information, so the information has to be recombined and resplit to make it function properly, adding to the perception of inefficiency. 4b. Detection of "executable" on Windows is simply wrong. Unix execution bits do not exist.
http://bugs.python.org/issue10483 for 4b.
4c. is_cgi doesn't properly handle PATHINFO parts of the path, this is the other half of 4a. The Python2.x CGIHTTPServer.py had this right, but the introduction and use of _url_collapse_path_split broke it.
http://bugs.python.org/issue10484 for 4a and 4c.
4d. Searching for a ? to find an explicit query string should use .find('?') rather than .rfind('?') as there is no prohibition on using '?' within a query string, AFAIK.
http://bugs.python.org/issue10485 for 4d.
4e. doesn't set the REQUEST_URI, HTTP_HOST, or HTTP_PORT environment variables for the CGI.
http://bugs.python.org/issue10486 for 4e.
4f. Should not send the 200 response until it sees if the CGI sends a Status: header.
http://bugs.python.org/issue10487 for 4f and 4g.
4g. Should not buffer all of stdout: subprocess.communicate is inappropriate for a web server CGI interface. The data should stream through to avoid consuming inordinate amounts of memory. The only solution within the current limitations of subprocess is to abandon stderr, force the CGI to do its own error logging, and use shutil.copyfileobj to hook up p.stdout to self.wfile once the Status: message processing has happened. 4h. Doesn't seem to close p.stdin (I'm not sure if that is necessary, it may happen when p is garbage collected, but effort was made to close p.stdout and p.stderr, which seem similar.)
Discovered that subprocess.communicate closes p.stdin, so it wasn't needed until I quit using .communicate in my version of the code.
On Sat, 20 Nov 2010 23:52:45 -0800, Glenn Linderman <v+python@g.nevcal.com> wrote:
Sadly, cgi.py input handling seems to depend on the email module, thought to be fixed for 3.2, but it is not clear if that has been achieved, or if the surrogate encode workaround is sufficient for this. More testing needed, but I don't have such a test case developed yet.
Indeed, this should theoretically be fixable now. The email module is now perfectly capable of both consuming and producing binary data. The user of the module doesn't need to care how this was achieved unless they want to do processing of non-RFC conformant data. I want to look at the CGI issue, but I'm not sure when I'll get to it. -- R. David Murray www.bitdance.com
On 11/21/2010 9:18 AM, R. David Murray wrote:
I want to look at the CGI issue, but I'm not sure when I'll get to it.
Actually, since this code was working before 3.x, and if email.parser can now accept binary streams, it seems like maybe the only thing that might be wrong is that presently it is getting a text stream instead, so that is something cgi.py or the application program would have to switch, and then maybe some testing would discover correctness, or maybe a specification of UTF-8 as the encoding to use for the text parts would have to be done.
On Sun, 21 Nov 2010 19:59:54 -0800, Glenn Linderman <v+python@g.nevcal.com> wrote:
On 11/21/2010 9:18 AM, R. David Murray wrote:
I want to look at the CGI issue, but I'm not sure when I'll get to it.
Actually, since this code was working before 3.x, and if email.parser can now accept binary streams, it seems like maybe the only thing that might be wrong is that presently it is getting a text stream instead, so that is something cgi.py or the application program would have to switch, and then maybe some testing would discover correctness, or maybe a specification of UTF-8 as the encoding to use for the text parts would have to be done.
Well, given the bytes/string split in Python3, code definitely has to be changed to make this work, since you have to explicitly call bytes processing routines (message_from_bytes, message_from_binary_file, BytesFeedparser, etc) to parse binary data, and likewise use BytesGenerator to emit binary data. -- R. David Murray www.bitdance.com
On 11/21/2010 8:39 PM, R. David Murray wrote:
I want to look at the CGI issue, but I'm not sure when I'll get to it. Actually, since this code was working before 3.x, and if email.parser can now accept binary streams, it seems like maybe the only thing that might be wrong is that presently it is getting a text stream instead, so
On 11/21/2010 9:18 AM, R. David Murray wrote: that is something cgi.py or the application program would have to switch, and then maybe some testing would discover correctness, or maybe a specification of UTF-8 as the encoding to use for the text parts would have to be done. Well, given the bytes/string split in Python3, code definitely has to be changed to make this work, since you have to explicitly call bytes
On Sun, 21 Nov 2010 19:59:54 -0800, Glenn Linderman<v+python@g.nevcal.com> wrote: processing routines (message_from_bytes, message_from_binary_file, BytesFeedparser, etc) to parse binary data, and likewise use BytesGenerator to emit binary data.
Looks like cgi.py also calls http.client and both of them would need to be changed to deal with bytes. I don't have the full translation of API calls in my head, nor have I ever used the email.parser API to know what the calls actually do... just read a bit about it... but that is different than using it... However, I find code in http.client.parse_headers that is attempting to work-around reading a binary stream and feeding email.parser a string. So definitely some work to be done to fix things. I did add some explicit threads to http.server CGI script code that I think work around the deadlocks that can result from attempting to serialize 3 pipes, and yet not require full buffering of stdin or stdout. At the moment, I still am doing full buffering of stderr, but that is thought to be small potatoes in an http.server environment, generally. But since my test case is a CGI form data, I'm stuck until this is fixed, or I wrap my head around the code in http.client and email.parser. But not tonight (yawn!).
participants (3)
-
Glenn Linderman
-
R. David Murray
-
Éric Araujo