[issue10226] urlparse example is wrong
New submission from Alexander Belopolsky <belopolsky@users.sourceforge.net>: The following example in Doc/library/urlparse.rst is wrong
urlparse('www.cwi.nl:80/%7Eguido/Python.html') ParseResult(scheme='', netloc='', path='www.cwi.nl:80/%7Eguido/Python.html', params='', query='', fragment='')
In the actual output, scheme='www.cwi.nl'. In addition, the preceding text is confusing and probably not grammatical: """ Otherwise, it is not possible to distinguish between netloc and path components, and would the indistinguishable component would be classified as the path as in a relative URL. """ Discovered while working on issue 10225. ---------- assignee: docs@python components: Documentation messages: 119855 nosy: belopolsky, docs@python priority: normal severity: normal status: open title: urlparse example is wrong versions: Python 2.7, Python 3.1, Python 3.2 _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10226> _______________________________________
Changes by Georg Brandl <georg@python.org>: ---------- assignee: docs@python -> orsenthil nosy: +orsenthil _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10226> _______________________________________
Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: Looks like I've been beaten again by make doctest picking up older python, but something is not right here: In Python 2.6.5:
urlparse('www.cwi.nl:80/%7Eguido/Python.html') ParseResult(scheme='www.cwi.nl', netloc='', path='80/%7Eguido/Python.html', params='', query='', fragment='')
but in 2.7:
urlparse('www.cwi.nl:80/%7Eguido/Python.html') ParseResult(scheme='', netloc='', path='www.cwi.nl:80/%7Eguido/Python.html', params='', query='', fragment='')
and the text preceding the example in the doc does not really tell which is right. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10226> _______________________________________
Georg Brandl <georg@python.org> added the comment: I think this is correct: it is the new behavior after the fix for #754016 was committed. ---------- nosy: +georg.brandl _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10226> _______________________________________
Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: On Fri, Oct 29, 2010 at 2:15 AM, Georg Brandl <report@bugs.python.org> wrote: ..
I think this is correct: it is the new behavior after the fix for #754016 was committed.
I agree. I kept the issue open because I cannot parse """ Otherwise, it is not possible to distinguish between netloc and path components, and would the indistinguishable component would be classified as the path as in a relative URL. """ ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10226> _______________________________________
Georg Brandl <georg@python.org> added the comment: That's for Senthil to rephrase as intended :) ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10226> _______________________________________
Senthil Kumaran <orsenthil@gmail.com> added the comment: - Otherwise, it is not possible to distinguish between netloc and path - components, and would the indistinguishable component would be classified - as the path as in a relative URL. + If the netloc does not start with '//', the module cannot distinguish it + from path and it would classify it as path component in the relative url. How does this sound? ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10226> _______________________________________
Éric Araujo <merwok@netwok.org> added the comment: // is not part of the netloc in RFC terms, it’s a delimiter between components ---------- nosy: +eric.araujo _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10226> _______________________________________
R. David Murray <rdmurray@bitdance.com> added the comment: How about this: - If the scheme value is not specified, urlparse following the syntax - specifications from RFC 1808, expects the netloc value to start with '//', - Otherwise, it is not possible to distinguish between net_loc and path - component and would classify the indistinguishable component as path as in - a relative url. + Following the syntax specifications in RFC 1808, urlparse recognizes + a netloc only if it is properly introduced by '//'. Otherwise the + input must be presumed to be a relative URL and thus to start with + a path component. However, it seems to me there is a bug here:
urlparse.urlparse('www.k.com:80/path') ParseResult(scheme='', netloc='', path='www.k.com:80/path', params='', query='', fragment='') urlparse.urlparse('www.k.com:path') ParseResult(scheme='www.k.com', netloc='', path='path', params='', query='', fragment='')
I think the second one is correct and that the first one should produce ParseResult(scheme='www.k.com', netloc='', path='80/path', params='', query='', fragment='') I haven't read all the way through the RFC again, though. But *one* of the above is wrong. ---------- nosy: +r.david.murray _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10226> _______________________________________
Senthil Kumaran <orsenthil@gmail.com> added the comment: Fixed the wordings in r86296(py3k), r86297(release31-maint) and r86298(release27-maint). David, for the examples you mentioned, the first one's parsing logic follows the explanation that is written. It is correct. For the second example, the port value not being a DIGIT exhibits such a behavior. I am unable to recollect the reason for this behavior. Either the URL is invalid (PORT is not a DIGIT, and parse module is simply ignoring to raise an error - it's okay, given the input is invalid) or it needs to distinguish the ':' as a port separator from path separator for some valid urls. I think, if we find a better reason to change something for the second scenario, we shall address that. ---------- resolution: -> fixed stage: -> committed/rejected status: open -> closed type: -> behavior _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10226> _______________________________________
R. David Murray <rdmurray@bitdance.com> added the comment: Senthil, no it isn't. There is no way to know a priori that ':80' represents a port number rather than a path, absent the // introducer for the netloc. This bug is fixed; I ought to open a new one for the path thing but perhaps I will wait for a user report instead :) ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue10226> _______________________________________
participants (5)
-
Alexander Belopolsky
-
Georg Brandl
-
R. David Murray
-
Senthil Kumaran
-
Éric Araujo