Re: [Python-Dev] Path object design
At 01:56 AM 11/4/2006 +0100, Andrew Dalke wrote:
os.join assumes the base is a directory name when used in a join: "inserting '/' as needed" while RFC 1808 says
The last segment of the base URL's path (anything following the rightmost slash "/", or the entire path if no slash is present) is removed
Is my intuition wrong in thinking those should be the same?
Yes. :) Path combining and URL absolutization(?) are inherently different operations with only superficial similarities. One reason for this is that a trailing / on a URL has an actual meaning, whereas in filesystem paths a trailing / is an aberration and likely an actual error. The path combining operation says, "treat the following as a subpath of the base path, unless it is absolute". The URL normalization operation says, "treat the following as a subpath of the location the base URL is *contained in*". Because of this, os.path.join assumes a path with a trailing separator is equivalent to a path without one, since that is the only reasonable way to interpret treating the joined path as a subpath of the base path. But for a URL join, the path /foo and the path /foo/ are not only *different paths* referring to distinct objects, but the operation wants to refer to the *container* of the referenced object. /foo might refer to a directory, while /foo/ refers to some default content (e.g. index.html). This is actually why Apache normally redirects you from /foo to /foo/ before it serves up the index.html; relative URLs based on a base URL of /foo won't work right. The URL approach is designed to make peer-to-peer linking in a given directory convenient. Instead of referring to './foo.html' (as one would have to do with filenames, you can simply refer to 'foo.html'. But the cost of saving those characters in every link is that joining always takes place on the parent, never the tail-end. Thus directory URLs normally end in a trailing /, and most tools tend to automatically redirect when somebody leaves it off. (Because otherwise the links would be wrong.)
Phillip J. Eby wrote:
At 01:56 AM 11/4/2006 +0100, Andrew Dalke wrote:
os.join assumes the base is a directory name when used in a join: "inserting '/' as needed" while RFC 1808 says
The last segment of the base URL's path (anything following the rightmost slash "/", or the entire path if no slash is present) is removed
Is my intuition wrong in thinking those should be the same?
Yes. :)
Path combining and URL absolutization(?) are inherently different operations with only superficial similarities. One reason for this is that a trailing / on a URL has an actual meaning, whereas in filesystem paths a trailing / is an aberration and likely an actual error.
The path combining operation says, "treat the following as a subpath of the base path, unless it is absolute". The URL normalization operation says, "treat the following as a subpath of the location the base URL is *contained in*".
Because of this, os.path.join assumes a path with a trailing separator is equivalent to a path without one, since that is the only reasonable way to interpret treating the joined path as a subpath of the base path.
But for a URL join, the path /foo and the path /foo/ are not only *different paths* referring to distinct objects, but the operation wants to refer to the *container* of the referenced object. /foo might refer to a directory, while /foo/ refers to some default content (e.g. index.html). This is actually why Apache normally redirects you from /foo to /foo/ before it serves up the index.html; relative URLs based on a base URL of /foo won't work right.
The URL approach is designed to make peer-to-peer linking in a given directory convenient. Instead of referring to './foo.html' (as one would have to do with filenames, you can simply refer to 'foo.html'. But the cost of saving those characters in every link is that joining always takes place on the parent, never the tail-end. Thus directory URLs normally end in a trailing /, and most tools tend to automatically redirect when somebody leaves it off. (Because otherwise the links would be wrong.)
Having said this, Andrew *did* demonstrate quite convincingly that the current urljoin has some fairly egregious directory traversal glitches. Is it really right to punt obvious gotchas like
urlparse.urljoin("http://blah.com/a/b/c", "../../../../")
to the server? regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Skype: holdenweb http://holdenweb.blogspot.com Recent Ramblings http://del.icio.us/steve.holden
Steve Holden wrote:
Having said this, Andrew *did* demonstrate quite convincingly that the current urljoin has some fairly egregious directory traversal glitches. Is it really right to punt obvious gotchas like
urlparse.urljoin("http://blah.com/a/b/c", "../../../../")
to the server?
See Paul Jimenez's thread about replacing urlparse with something better. The current module has some serious issues :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
On 11/3/06, Steve Holden <steve@holdenweb.com> wrote:
Having said this, Andrew *did* demonstrate quite convincingly that the current urljoin has some fairly egregious directory traversal glitches. Is it really right to punt obvious gotchas like
urlparse.urljoin("http://blah.com/a/b/c", "../../../../")
Ah, but how do you know when that's wrong? At least under ftp:// your root is often a mid-level directory until you change up out of it. http:// will tend to treat the targets as roots, but I don't know that there's any requirement for a /.. to be meaningless (even if it often is). -- Michael Urman http://www.tortall.net/../mu/blog ;)
Michael Urman wrote:
On 11/3/06, Steve Holden <steve@holdenweb.com> wrote:
Having said this, Andrew *did* demonstrate quite convincingly that the current urljoin has some fairly egregious directory traversal glitches. Is it really right to punt obvious gotchas like
urlparse.urljoin("http://blah.com/a/b/c", "../../../../")
Ah, but how do you know when that's wrong? At least under ftp:// your root is often a mid-level directory until you change up out of it. http:// will tend to treat the targets as roots, but I don't know that there's any requirement for a /.. to be meaningless (even if it often is).
I'm darned if I know. I simply know that it isn't right for http resources. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Skype: holdenweb http://holdenweb.blogspot.com Recent Ramblings http://del.icio.us/steve.holden
Steve Holden wrote:
Ah, but how do you know when that's wrong? At least under ftp:// your root is often a mid-level directory until you change up out of it. http:// will tend to treat the targets as roots, but I don't know that there's any requirement for a /.. to be meaningless (even if it often is).
I'm darned if I know. I simply know that it isn't right for http resources.
the URI specification disagrees; an URI that starts with "../" is per- fectly legal, and the specification explicitly states how it should be interpreted. (it's important to realize that "urijoin" produces equivalent URI:s, not file names) </F>
Steve:
I'm darned if I know. I simply know that it isn't right for http resources.
/F:
the URI specification disagrees; an URI that starts with "../" is per- fectly legal, and the specification explicitly states how it should be interpreted.
I have looked at the spec, and can't figure out how its explanation matches the observed urljoin results. Steve's excerpt trimmed out the strangest example.
urlparse.urljoin("http://blah.com/a/b/c", "../../../") 'http://blah.com/../' urlparse.urljoin("http://blah.com/a/b/c", "../../../..") # What?! 'http://blah.com/' urlparse.urljoin("http://blah.com/a/b/c", "../../../../") 'http://blah.com/../../'
(it's important to realize that "urijoin" produces equivalent URI:s, not file names)
Both, though, are "paths". The OP, Mik Orr, wrote: I agree that supporting non-filesystem directories (zip files, CSV/Subversion sandboxes, URLs) would be nice, but we already have a big enough project without that. What constraints should a Path object keep in mind in order to be forward-compatible with this? Is the answer therefore that URLs and URI behaviour should not place constraints on a Path object becuse they are sufficiently dissimilar from file-system paths? Do these other non-FS hierarchical structures have similar differences causing a semantic mismatch? Andrew dalke@dalkescientific.com
On 11/5/06, Andrew Dalke <dalke@dalkescientific.com> wrote:
I agree that supporting non-filesystem directories (zip files, CSV/Subversion sandboxes, URLs) would be nice, but we already have a big enough project without that. What constraints should a Path object keep in mind in order to be forward-compatible with this?
Is the answer therefore that URLs and URI behaviour should not place constraints on a Path object becuse they are sufficiently dissimilar from file-system paths? Do these other non-FS hierarchical structures have similar differences causing a semantic mismatch?
This discussion has renforced my belief that os.path.join's behavior is correct with non-initial absolute args: os.path.join('/usr/bin', '/usr/local/bin/python') I've used that in applications and haven't found it a burden. Its behavior with '..' seems justifiable too, and Talin's trick of wrapping everything in os.path.normpath is a great one. I do think join should take more care to avoid multiple slashes together in the middle of a path, although this is really the responsibility of the platform library, not a generic function/method. Join is true to its documentation of only adding separators and never than deleting them, but that seems like a bit of sloppiness. On the other hand, the filesystems don't care; I don't think anybody has mentioned a case where it actually creates a path the filesystem can't handle. urljoin clearly has a different job. When we talked about extending path to URLs, I was thinking more in terms of opening files, fetching resources, deleting, renaming, etc. rather than split-modify-rejoin. A hypothetical urlpath module would clearly have to follow the URL rules. I don't see a contradition in supporting both URL joining rules and having a non-initial absolute argument, just to avoid cross-"platform" surprises. But urlpath would also need methods to parse the scheme and host on demand, query strings, #fragments, a class method for building a URL from the smallest parts, etc. As for supporting path fragments and '..' in join arguments (for filesystem paths), it's clearly too widely used to eliminate. Users can voluntarily refrain from passing arguments containing separators. For cases involving a user-supplied -- possibly hostile -- path, either a separate method (safe_join, child) could achieve this, or a subclass implemetation that allows only safe arguments. Regarding pathname-manipulation methods and filesystem-access methods, I'm not sure how workable it is to have separate objects for them. os.mkdir( Path("/usr/local/lib/python/Cheetah/Template.py").parent ) Path("/usr/local/lib/python/Cheetah/Template.py").parent.mkdir() FileAccess( Path("/usr/local/lib/python/Cheetah/Template.py").parent ).mkdir() The first two are reasonable. The third... who would want to do this for every path? How often would you reuse the FileAccess object? I typically create Path objects from configuration values and keep them around for the entire application; e.g., data_dir. Then I create derived paths as necessary. I suppose if the FileAccess object has a .path attribute, it could do double-duty so you wouldn't have to store the path separately. Is this what the advocates of two classes have in mind? With usage like this? my_file = FileAccess( file_access_obj.path.joinpath("my_file") ) my_file = FileAccess( Path(file_access_obj,path, "my_file") ) Working on my Path implementation. (Yes it's necessary, Glyph, at least to me.) It's going slow because I just got a Macintosh laptop and am still rounding up packages to install. -- Mike Orr <sluggoster@gmail.com>
Andrew Dalke schrieb:
I have looked at the spec, and can't figure out how its explanation matches the observed urljoin results. Steve's excerpt trimmed out the strangest example.
Unfortunately, you didn't say which of these you want explained. As it is tedious to write down even a single one, I restrain to the one with the What?! remark.
urlparse.urljoin("http://blah.com/a/b/c", "../../../..") # What?! 'http://blah.com/'
Please follow me through section 5 of http://www.ietf.org/rfc/rfc3986.txt 5.2.1: Pre-parse the Base URI B.scheme = "http" B.authority = "blah.com" B.path = "/a/b/c" B.query = undefined B.fragment = undefined 5.2.2: Transform References parse("../../../..") R.scheme = R.authority = R.query = R.fragment = undefined R.path = "../../../.." (strictness not relevant, R.scheme is already undefined) R.scheme is not defined R.authority is not defined R.path is not "" R.path does not start with / T.path = merge("/a/b/c", "../../../..") T.path = remove_dot_segments(T.path) T.authority = "blah.com" T.scheme = "http" T.fragment = undefined 5.2.3 Merge paths merge("/a/b/c", "../../../..") = (base URI does have path) "/a/b/../../../.." 5.2.4 Remove Dot Segments remove_dot_segments("/a/b/../../../..") 1. I = "/a/b/../../../.." O = "" 2. A (does not apply) B (does not apply) C (does not apply) D (does not apply) E O="/a" I="/b/../../../.." 2. E O="/a/b" I="/../../../.." 2. C O="/a" I="/../../.." 2. C O="" I="/../.." 2. C O="" I="/.." 2. C O="" I="/" 2. E O="/" I="" 3. Result: "/" 5.3 Component Recomposition result = "" (scheme is defined) result = "http:" (authority is defined) result = "http://blah.com" (append path) result = "http://blah.com/" HTH, Martin
Martin:
Unfortunately, you didn't say which of these you want explained. As it is tedious to write down even a single one, I restrain to the one with the What?! remark.
urlparse.urljoin("http://blah.com/a/b/c", "../../../..") # What?! 'http://blah.com/'
The "What?!" is in context with the previous and next entries. I've reduced it to a simpler case
urlparse.urljoin("http://blah.com/", "..") 'http://blah.com/' urlparse.urljoin("http://blah.com/", "../") 'http://blah.com/../' urlparse.urljoin("http://blah.com/", "../..") 'http://blah.com/'
Does the result make sense to you? Does it make sense that the last of these is shorter than the middle one? It sure doesn't to me. I thought it was obvious that there was an error; obvious enough that I didn't bother to track down why - especially as my main point was to argue there are different ways to deal with hierarchical/path-like schemes, each correct for its given domain.
Please follow me through section 5 of
The core algorithm causing the "what?!" comes from "reduce_dot_segments", section 5.2.4. In parallel my 3 cases should give: 5.2.4 Remove Dot Segments remove_dot_segments("/..") r_d_s("/../") r_d_s("/../..") 1. I = "/.." I="/../" I="/../.." O = "" O="" O="" 2A. (does not apply) 2A. (does not apply) 2A. (does not apply) 2B. (does not apply) 2B. (does not apply) 2B. (does not apply) 2C. O="" I="/" 2C. O="" I="/" 2C. O="" I="/.." 2A. (does not apply) 2A. (does not apply) .. reduces to r_d_s("/..") 2B. (does not apply) 2B. (does not apply) 3. Result "/" 2C. (does not apply) 2C. (does not apply) 2D. (does not apply) 2D. (does not apply) 2E. O="/", I="" 2E. O="/", I="" 3. Result: "/" 3. Result "/" My reading of the RFC 3986 says all three examples should produce the same result. The fact that my "what?!" comment happens to be correct according to that RFC is purely coincidental. Then again, urlparse.py does *not* claim to be RFC 3986 compliant. The module docstring is """Parse (absolute and relative) URLs. See RFC 1808: "Relative Uniform Resource Locators", by R. Fielding, UC Irvine, June 1995. """ I tried the same code with 4Suite, which does claim compliance, and get
import Ft from Ft.Lib import Uri Uri.Absolutize("..", "http://blah.com/") 'http://blah.com/' Uri.Absolutize("../", "http://blah.com/") 'http://blah.com/' Uri.Absolutize("../..", "http://blah.com/") 'http://blah.com/'
The text of it's Uri.py says This function is similar to urlparse.urljoin() and urllib.basejoin(). Those functions, however, are (as of Python 2.3) outdated, buggy, and/or designed to produce results acceptable for use with other core Python libraries, rather than being earnest implementations of the relevant specs. Their problems are most noticeable in their handling of same-document references and 'file:' URIs, both being situations that come up far too often to consider the functions reliable enough for general use. """ # Reasons to avoid using urllib.basejoin() and urlparse.urljoin(): # - Both are partial implementations of long-obsolete specs. # - Both accept relative URLs as the base, which no spec allows. # - urllib.basejoin() mishandles the '' and '..' references. # - If the base URL uses a non-hierarchical or relative path, # or if the URL scheme is unrecognized, the result is not # always as expected (partly due to issues in RFC 1808). # - If the authority component of a 'file' URI is empty, # the authority component is removed altogether. If it was # not present, an empty authority component is in the result. # - '.' and '..' segments are not always collapsed as well as they # should be (partly due to issues in RFC 1808). # - Effective Python 2.4, urllib.basejoin() *is* urlparse.urljoin(), # but urlparse.urljoin() is still based on RFC 1808. In searching the archives http://mail.python.org/pipermail/python-dev/2005-September/056152.html Fabien Schwob:
I'm using the module urlparse and I think I've found a bug in the urlparse module. When you merge an url and a link like"../../../page.html" with urljoin, the new url created keep some "../" in it. Here is an example :
import urlparse begin = "http://www.example.com/folder/page.html" end = "../../../otherpage.html" urlparse.urljoin(begin, end) 'http://www.example.com/../../otherpage.html'
Guido:
You shouldn't be giving more "../" sequences than are possible. I find the current behavior acceptable.
(Aparently for RFC 1808 that's a valid answer; it was an implementation choice in how to handle that case.) While not directly relevant, postings like John J Lee's http://mail.python.org/pipermail/python-bugs-list/2006-February/031875.html
The urlparse.urlparse() code should not be changed, for backwards compatibility reasons.
strongly suggest a desire to not change that code. The last definitive statement on this topic that I could find was mentioned in http://www.python.org/dev/summary/2005-11-16_2005-11-30/#updating-urlparse-t...
Guido pointed out that the main purpose of urlparse is to be RFC-compliant. Paul explained that the current code is valid according to RFC 1808 (1995-1998), but that this was superceded by RFC 2396 (1998-2004) and RFC 3986 (2005-). Guido was convinced, and asked for a new API (for backwards compatibility) and a patch to be submitted via sourceforge.
As this is not a bug, I have added the feature request 1591035 to SF titled "update urlparse to RFC 3986". Nothing else appeared to exist on that specific topic. Andrew dalke@dalkescientific.com
Andrew Dalke schrieb:
urlparse.urljoin("http://blah.com/", "..") 'http://blah.com/' urlparse.urljoin("http://blah.com/", "../") 'http://blah.com/../' urlparse.urljoin("http://blah.com/", "../..") 'http://blah.com/'
Does the result make sense to you? Does it make sense that the last of these is shorter than the middle one? It sure doesn't to me. I thought it was obvious that there was an error;
That wasn't obvious at all to me. Now looking at the examples, I agree there is an error. The middle one is incorrect; urlparse.urljoin("http://blah.com/", "../") should also give 'http://blah.com/'.
You shouldn't be giving more "../" sequences than are possible. I find the current behavior acceptable.
(Aparently for RFC 1808 that's a valid answer; it was an implementation choice in how to handle that case.)
There is still some text left to that respect in 5.4.2 of RFC 3986.
While not directly relevant, postings like John J Lee's http://mail.python.org/pipermail/python-bugs-list/2006-February/031875.html
The urlparse.urlparse() code should not be changed, for backwards compatibility reasons.
strongly suggest a desire to not change that code.
This is John J Lee's opinion, of course. I don't see a reason not to fix such bugs, or to update the implementation to the current RFCs.
As this is not a bug, I have added the feature request 1591035 to SF titled "update urlparse to RFC 3986". Nothing else appeared to exist on that specific topic.
Thanks. It always helps to be more specific; being less specific often hurts. I find there is a difference between "urllib behaves non-intuitively" and "urllib gives result A for parameters B and C, but should give result D instead". Can you please add specific examples to your report that demonstrate the difference between implemented and expected behavior? Regards, Martin
Me [Andrew]:
As this is not a bug, I have added the feature request 1591035 to SF titled "update urlparse to RFC 3986". Nothing else appeared to exist on that specific topic.
Martin:
Thanks. It always helps to be more specific; being less specific often hurts.
So does being more specific. I wasn't trying to report a bug in urlparse. I figured everyone knew the problems existed. The code comments say so and various back discussions on this list say so. All I wanted to do what point out that two seemingly similar problems - path traversal of hierarchical structures - had two different expected behaviors. Now I've spent entirely too much time on specifics I didn't care about and didn't think were important. I've also been known to do the full report and have people ignore what I wrote because it was too long.
I find there is a difference between "urllib behaves non-intuitively" and "urllib gives result A for parameters B and C, but should give result D instead". Can you please add specific examples to your report that demonstrate the difference between implemented and expected behavior?
No. I consider the "../" cases to be unimportant edge cases and I would rather people fixed the other problems highlighted in the text I copied from 4Suite's Uri.py -- like improperly allowing a relative URL as the base url, which I incorrectly assumed was legit - and that others have reported on python-dev, easily found with Google. If I only add test cases for "../" then I believe that that's all that will be fixed. Given the back history of this problem and lack of followup I also believe it won't be fixed unless someone develops a brand new module, from scratch, which will be added to some future Python version. There's probably a compliance suite out there to use for this sort of task. I hadn't bothered to look as I am no more proficient than others here at Google. Finally, I see that my report is a dup. SF search is poor. As Nick Coghlan reported, Paul Jimenez has a replacement for urlparse. Summarized in http://www.python.org/dev/summary/2006-04-01_2006-04-15/ It was submitted in spring as a patch - SF# 1462525 at http://sourceforge.net/tracker/index.php?func=detail&aid=1462525&group_id=5470&atid=305470 which I didn't find in my earlier searching. Andrew dalke@dalkescientific.com
Andrew Dalke schrieb:
I find there is a difference between "urllib behaves non-intuitively" and "urllib gives result A for parameters B and C, but should give result D instead". Can you please add specific examples to your report that demonstrate the difference between implemented and expected behavior?
No.
I consider the "../" cases to be unimportant edge cases and I would rather people fixed the other problems highlighted in the text I copied from 4Suite's Uri.py -- like improperly allowing a relative URL as the base url, which I incorrectly assumed was legit - and that others have reported on python-dev, easily found with Google.
It still should be possible to come up with examples for these as well, no? For example, if you pass a relative URI as the base URI, what would you like to see happen?
If I only add test cases for "../" then I believe that that's all that will be fixed.
That's true. Actually, it's probably not true; it will only get fixed if some volunteer contributes a fix.
Finally, I see that my report is a dup. SF search is poor. As Nick Coghlan reported, Paul Jimenez has a replacement for urlparse. Summarized in http://www.python.org/dev/summary/2006-04-01_2006-04-15/ It was submitted in spring as a patch - SF# 1462525 at http://sourceforge.net/tracker/index.php?func=detail&aid=1462525&group_id=5470&atid=305470 which I didn't find in my earlier searching.
So do you think this patch meets your requirements? This topic (URL parsing) is not only inherently difficult to implement, it is just as tedious to review. Without anybody reviewing the contributed code, it's certain that it will never be incorporated. Regards, Martin
Martin:
It still should be possible to come up with examples for these as well, no? For example, if you pass a relative URI as the base URI, what would you like to see happen?
Until two days ago I didn't even realize that was an incorrect use of urljoin. I can't be the only one. Hence, raise an exception - just like 4Suite's Uri.py does.
That's true. Actually, it's probably not true; it will only get fixed if some volunteer contributes a fix.
And it's not I. A true fix is a lot of work. I would rather use Uri.py, now that I see it handles everything I care about, and then some. Eg, file name <-> URI conversion.
So do you think this patch meets your requirements?
# new
uriparse.urljoin("http://spam/", "foo/bar") 'http://spam//foo/bar'
# existing
urlparse.urljoin("http://spam/", "foo/bar") 'http://spam/foo/bar'
No. That was the first thing I tried. Also found
urlparse.urljoin("http://blah", "/spam/") 'http://blah/spam/' uriparse.urljoin("http://blah", "/spam/") 'http://blah/spam'
I reported these on the patch page. Nothing else strange came up, but I did only try http urls and not the others. My "requirements", meaning my vague, spur-of-the-moment thoughts without any research or experimentation to determing their validity, are different than those for Python. My real requirements are met by the existing code. My imagined ones include support for edge cases, the idna codec, unicode, and real-world use on a variety of OSes. 4Suite's Uri.py seems to have this. Eg, lots of edge-case code like # On Windows, ensure that '|', not ':', is used in a drivespec. if os.name == 'nt' and scheme == 'file': path = path.replace(':','|',1) Hence the uriparse.py patch does not meet my hypothetical requirements . Python's requirements are probably to get closer to the spec. In which case yes, it's at least as good as and likely generally better than the existing module, modulo a few API naming debates and perhaps some rough edges which will be found when put into use. And perhaps various arguments about how bug compatible it should be and if the old code should be available as well as the new one, for those who depend on the existing 1808-allowed implementation dependent behavior. For those I have not the experience to guide me and no care to push the debate. I've decided I'm going to experiment using 4Suite's Uri.py for my code because it handles things I want which are outside of the scope of uriparse.py
This topic (URL parsing) is not only inherently difficult to implement, it is just as tedious to review. Without anybody reviewing the contributed code, it's certain that it will never be incorporated.
I have a different opinion. Python's url manipulation code is a mess. urlparse, urllib, urllib2. Why is "urlencode" part of urllib and not urllib2? For that matter, urllib is labeled 'Open an arbitrary URL' and not 'and also do manipulations on parts of URLs." I don't want to start fixing code because doing it the way I want to requires a new API and a much better understanding of the RFCs than I care about, especially since 4Suite and others have already done this. Hence I would say to just grab their library. And perhaps update the naming scheme. Also, urlgrabber and pycURL are better for downloading arbitrary URIs. For some definitions of "better". Andrew dalke@dalkescientific.com
Andrew Dalke schrieb:
Hence I would say to just grab their library. And perhaps update the naming scheme.
Unfortunately, this is not an option. *You* can just grab their library; the Python distribution can't. Doing so would mean to fork, and history tells that forks cause problems in the long run. OTOH, if the 4Suite people would contribute the library, integrating it would be an option. Regards, Martin
Martin v. Löwis wrote:
Andrew Dalke schrieb:
urlparse.urljoin("http://blah.com/", "..") 'http://blah.com/' urlparse.urljoin("http://blah.com/", "../") 'http://blah.com/../' urlparse.urljoin("http://blah.com/", "../..") 'http://blah.com/'
Does the result make sense to you? Does it make sense that the last of these is shorter than the middle one? It sure doesn't to me. I thought it was obvious that there was an error;
That wasn't obvious at all to me. Now looking at the examples, I agree there is an error. The middle one is incorrect;
urlparse.urljoin("http://blah.com/", "../")
should also give 'http://blah.com/'.
make that: could also give 'http://blah.com/'. as I said, today's urljoin doesn't guarantee that the output is the *shortest* possible way to represent the resulting URI. </F>
Andrew:
urlparse.urljoin("http://blah.com/", "..") 'http://blah.com/' urlparse.urljoin("http://blah.com/", "../") 'http://blah.com/../' urlparse.urljoin("http://blah.com/", "../..") 'http://blah.com/'
/F:
as I said, today's urljoin doesn't guarantee that the output is the *shortest* possible way to represent the resulting URI.
I didn't think anyone was making that claim. The module claims RFC 1808 compliance. From the docstring: DESCRIPTION See RFC 1808: "Relative Uniform Resource Locators", by R. Fielding, UC Irvine, June 1995. Now quoting from RFC 1808: 5.2. Abnormal Examples Although the following abnormal examples are unlikely to occur in normal practice, all URL parsers should be capable of resolving them consistently. Each example uses the same base as above. An empty reference resolves to the complete base URL: <> = <URL:http://a/b/c/d;p?q#f> Parsers must be careful in handling the case where there are more relative path ".." segments than there are hierarchical levels in the base URL's path. My claim is that "consistent" implies "in the spirit of the rest of the RFC" and "to a human trying to make sense of the results" and not only mean "does the same thing each time." Else
urljoin("http://blah.com/", "../../..") 'http://blah.com/there/were/too/many/dot-dot/path/elements/in/the/relative/ur...'
would be equally consistent.
for rel in ".. ../ ../.. ../../ ../../.. ../../../ ../../../..".split(): ... print repr(rel), repr(urlparse.urljoin("http://blah.com/", rel)) ... '..' 'http://blah.com/' '../' 'http://blah.com/../' '../..' 'http://blah.com/' '../../' 'http://blah.com/../../' '../../..' 'http://blah.com/../' '../../../' 'http://blah.com/../../../' '../../../..' 'http://blah.com/../../'
I grant there is a consistency there. It's not one most would have predicted beforehand. Then again, "should" is that wishy-washy "unless you've got a good reason to do it a different way" sort of constraint. Andrew dalke@dalkescientific.com
Andrew Dalke wrote:
as I said, today's urljoin doesn't guarantee that the output is the *shortest* possible way to represent the resulting URI.
I didn't think anyone was making that claim. The module claims RFC 1808 compliance. From the docstring:
DESCRIPTION See RFC 1808: "Relative Uniform Resource Locators", by R. Fielding, UC Irvine, June 1995.
Now quoting from RFC 1808:
5.2. Abnormal Examples
Although the following abnormal examples are unlikely to occur in normal practice, all URL parsers should be capable of resolving them consistently.
My claim is that "consistent" implies "in the spirit of the rest of the RFC" and "to a human trying to make sense of the results" and not only mean "does the same thing each time." Else
urljoin("http://blah.com/", "../../..") 'http://blah.com/there/were/too/many/dot-dot/path/elements/in/the/relative/ur...'
would be equally consistent.
perhaps, but such an urljoin wouldn't pass the minimize(base + relative) == minimize(urljoin(base, relative)) test that today's urljoin passes (where "minimize" is defined as "create the shortest possible URI that identifies the same target, according to the relevant RFC"). isn't the real issue in this subthread whether urljoin should be expected to pass the minimize(base + relative) == urljoin(base, relative) test? </F>
Fredrik Lundh wrote:
Andrew Dalke wrote:
as I said, today's urljoin doesn't guarantee that the output is the *shortest* possible way to represent the resulting URI.
I didn't think anyone was making that claim. The module claims RFC 1808 compliance. From the docstring:
DESCRIPTION See RFC 1808: "Relative Uniform Resource Locators", by R. Fielding, UC Irvine, June 1995.
Now quoting from RFC 1808:
5.2. Abnormal Examples
Although the following abnormal examples are unlikely to occur in normal practice, all URL parsers should be capable of resolving them consistently.
My claim is that "consistent" implies "in the spirit of the rest of the RFC" and "to a human trying to make sense of the results" and not only mean "does the same thing each time." Else
urljoin("http://blah.com/", "../../..")
'http://blah.com/there/were/too/many/dot-dot/path/elements/in/the/relative/ur...'
would be equally consistent.
perhaps, but such an urljoin wouldn't pass the
minimize(base + relative) == minimize(urljoin(base, relative))
test that today's urljoin passes (where "minimize" is defined as "create the shortest possible URI that identifies the same target, according to the relevant RFC").
isn't the real issue in this subthread whether urljoin should be expected to pass the
minimize(base + relative) == urljoin(base, relative)
test?
I should hope that *is* the issue, and I should further hope that the general wish would be for it to pass that test. Of course web systems have been riddled with canonicalization errors in the past, so it'd be best if you and/or Andrew could provide a minimize() implementation :-) regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Skype: holdenweb http://holdenweb.blogspot.com Recent Ramblings http://del.icio.us/steve.holden
Fredrik Lundh schrieb:
urlparse.urljoin("http://blah.com/", "../")
should also give 'http://blah.com/'.
make that: could also give 'http://blah.com/'.
How so? If that would implement RFC 3986, you can get only a single outcome, if urljoin is meant to implement section 5 of that RFC. Regards, Martin
Michael Urman writes:
Ah, but how do you know when that's wrong? At least under ftp:// your root is often a mid-level directory until you change up out of it. http:// will tend to treat the targets as roots, but I don't know that there's any requirement for a /.. to be meaningless (even if it often is).
ftp and http schemes both have authority ("host") components, so the meaning of ".." path components is defined in the same way for both by section 5 of RFC 3986. Of course an FTP server is not bound to interpret the protocol so as to mimic URL semantics. But that's a different question.
participants (9)
-
"Martin v. Löwis"
-
Andrew Dalke
-
Fredrik Lundh
-
Michael Urman
-
Mike Orr
-
Nick Coghlan
-
Phillip J. Eby
-
stephen@xemacs.org
-
Steve Holden