[Python-Dev] Path object design

Sat Nov 4 05:34:12 CET 2006

Phillip J. Eby wrote:
> At 01:56 AM 11/4/2006 +0100, Andrew Dalke wrote:
> 
>>os.join assumes the base is a directory
>>name when used in a join: "inserting '/' as needed" while RFC
>>1808 says
>>
>>           The last segment of the base URL's path (anything
>>           following the rightmost slash "/", or the entire path if no
>>           slash is present) is removed
>>
>>Is my intuition wrong in thinking those should be the same?
> 
> 
> Yes.  :)
> 
> Path combining and URL absolutization(?) are inherently different 
> operations with only superficial similarities.  One reason for this is that 
> a trailing / on a URL has an actual meaning, whereas in filesystem paths a 
> trailing / is an aberration and likely an actual error.
> 
> The path combining operation says, "treat the following as a subpath of the 
> base path, unless it is absolute".  The URL normalization operation says, 
> "treat the following as a subpath of the location the base URL is 
> *contained in*".
> 
> Because of this, os.path.join assumes a path with a trailing separator is 
> equivalent to a path without one, since that is the only reasonable way to 
> interpret treating the joined path as a subpath of the base path.
> 
> But for a URL join, the path /foo and the path /foo/ are not only 
> *different paths* referring to distinct objects, but the operation wants to 
> refer to the *container* of the referenced object.  /foo might refer to a 
> directory, while /foo/ refers to some default content (e.g. 
> index.html).  This is actually why Apache normally redirects you from /foo 
> to /foo/ before it serves up the index.html; relative URLs based on a base 
> URL of /foo won't work right.
> 
> The URL approach is designed to make peer-to-peer linking in a given 
> directory convenient.  Instead of referring to './foo.html' (as one would 
> have to do with filenames, you can simply refer to 'foo.html'.  But the 
> cost of saving those characters in every link is that joining always takes 
> place on the parent, never the tail-end.  Thus directory URLs normally end 
> in a trailing /, and most tools tend to automatically redirect when 
> somebody leaves it off.  (Because otherwise the links would be wrong.)
> 
Having said this, Andrew *did* demonstrate quite convincingly that the 
current urljoin has some fairly egregious directory traversal glitches. 
Is it really right to punt obvious gotchas like

 >>>urlparse.urljoin("http://blah.com/a/b/c", "../../../../")

'http://blah.com/../../'

 >>>

to the server?

regards
  Steve
-- 
Steve Holden       +44 150 684 7255  +1 800 494 3119
Holden Web LLC/Ltd          http://www.holdenweb.com
Skype: holdenweb       http://holdenweb.blogspot.com
Recent Ramblings     http://del.icio.us/steve.holden