[New-bugs-announce] [issue14567] http/server.py query string handling incorrect, inefficient
report at bugs.python.org
Thu Apr 12 21:53:07 CEST 2012
New submission from Glenn Linderman <v+python at g.nevcal.com>:
A URL potentially consists of four parts: path, PATH_INFO, anchor, QUERY_STRING. The syntax is roughly:
where # and ? characters play key roles.
is_cgi not-so-cleverly passes the whole request to _url_collapse_path for normalization, resulting in inappropriately normalizing the anchor and query-string as well as appropriately normalizing the path and path info parts. Consider that there is no syntax restrictions preventing a query string or anchor from containing / and characters. I contrived such strings, and observed that they were passed to _url_collapse_path and were inappropriately "normalized".
Now for non-CGI usage, both the anchor and query-string are ignored by server.py (see translate_path which has code to chop them off). For non-CGI usage, translate_path is called just once, so this isn't particularly inefficient nor is it incorrect. However, for CGI usage it can be both.
It is not clear to me why browsers even send the anchor to the server, but they do. Perhaps there are servers that process it, but this one doesn't, in any case I can find, and I'm unaware of semantics applied by any server. Therefore, parsing and ignoring it seems appropriate.
Presently, inappropriate normalization of the query-string causes no correctness problem for CGI requests because of issue 14566 which reverts back to the original path. Should issue 14566 be fixed as recommended, certain query-strings will be mangled by the inappropriate normalization. However, even though there is not presently a correctness problem, there is an efficiency problem for CGI requests: the anchor and query-string are both needlessly processed by _url_collapse_path, and potentially repeatedly processed by translate_path in the loop at the top of run_cgi. In fact, it is not exactly clear whether anchors and query-strings containing cleverly crafted / and . and .. combinations might be able to confuse that loop... because translate_path chops them off, but the loop does not!
It seems to me that the appropriate place to separate the anchor and query string would be in parse_request. Instead of creating self.command, self.path, self.request_version there, I think it would be better to divide the current content of self.path into three parts: self.path, self.anchor, and self.query_string.
self.path, then, would unambiguously contain only path parts, and would be appropriately passed through _url_collapse_path... perhaps even right there in parse_request ... it seems appropriate to normalize the path for reqular files as well as CGI files for all the same reasons.
1) no ability to access files outside the configured realm due to malicious use of .. as a path component -- although it seems translate_path prevents this anyway, in a complex manner.
2) proper application of unquote, to allow # and ? characters to exist in path parts, and may also legally be applied to other characters by browser code
Neither of these actions are presently performed for regular files, only for CGI files.
If the processing happens in parse_request, then other code would be affected:
is_cgi would no longer need to call _url_collapse_path, as it would already be done
translate_path would no longer need to parse off query strings or anchors, as it would already be done
translate_path would no longer need to call urllib.parse.unquote, as it would already be done.
run_cgi would no longer need to parse anchor or query, but instead could reference query as self.query_string
components: Library (Lib)
nosy: orsenthil, v+python
title: http/server.py query string handling incorrect, inefficient
versions: Python 2.6, Python 2.7, Python 3.1, Python 3.2, Python 3.3
Python tracker <report at bugs.python.org>
More information about the New-bugs-announce