Re: [Python-Dev] Path object design
On 06:14 pm, fredrik@pythonware.com wrote:
glyph@divmod.com wrote:
I assert that it needs a better[1] interface because the current interface can lead to a variety of bugs through idiomatic, apparently correct usage. All the more because many of those bugs are related to critical errors such as security and data integrity.
instead of referring to some esoteric knowledge about file systems that us non-twisted-using mere mortals may not be evolved enough to under- stand,
On the contrary, twisted users understand even less, because (A) we've been demonstrated to get it wrong on numerous occasions in highly public and embarrassing ways and (B) we already have this class that does it all for us and we can't remember how it works :-).
maybe you could just make a list of common bugs that may arise due to idiomatic use of the existing primitives?
Here are some common gotchas that I can think of off the top of my head. Not all of these are resolved by Twisted's path class: Path manipulation: * This is confusing as heck:
os.path.join("hello", "/world") '/world' os.path.join("hello", "slash/world") 'hello/slash/world' os.path.join("hello", "slash//world") 'hello/slash//world' Trying to formulate a general rule for what the arguments to os.path.join are supposed to be is really hard. I can't really figure out what it would be like on a non-POSIX/non-win32 platform.
* it seems like slashes should be more aggressively converted to backslashes on windows, because it's near impossible to do anything with os.sep in the current situation. * "C:blah" does not mean what you think it means on Windows. Regardless of what you think it means, it is not that. I thought I understood it once as the current process having a current directory on every mapped drive, but then I had to learn about UNC paths of network mapped drives and it stopped making sense again. * There are special files on windows such as "CON" and "NUL" which exist in _every_ directory. Twisted does get around this, by looking at the result of abspath:
os.path.abspath("c:/foo/bar/nul") '\\\\nul'
* Sometimes a path isn't a path; the zip "paths" in sys.path are a good example. This is why I'm a big fan of including a polymorphic interface of some kind: this information is *already* being persisted in an ad-hoc and broken way now, so it needs to be represented; it would be good if it were actually represented properly. URL manipulation-as-path-manipulation is another; the recent perforce use-case mentioned here is a special case of that, I think. * paths can have spaces in them and there's no convenient, correct way to quote them if you want to pass them to some gross function like os.system - and a lot of the code that manipulates paths is shell-script-replacement crud which wants to call gross functions like os.system. Maybe this isn't really the path manipulation code's fault, but it's where people start looking when they want properly quoted path arguments. * you have to care about unicode sometimes. rarely enough that none of your tests will ever account for it, but often enough that _some_ users will notice breakage if your code is ever widely distributed. this is an even more obscure example, but pygtk always reports pathnames in utf8-encoded *byte* strings, regardless of your filesystem encoding. If you forget to decode/encode it, hilarity ensues. There's no consistent error reporting (as far as I can tell, I have encountered this rarely) and no real way to detect this until you have an actual insanely-configured system with an insanely-named file on it to test with. (Polymorphic interfaces might help a *bit* here. At worst, they would at least make it possible to develop a canonical "insanely encoded filesystem" test-case backend. At best, you'd absolutely have to work in terms of unicode all the time, and no implicit encoding issues would leak through to application code.) Twisted's thing doesn't deal with this at all, and it really should. * also *sort* of an encoding issue, although basically only for webservers or other network-accessible paths: thanks to some of these earlier issues as well as %2e%2e, there are effectively multiple ways to spell "..". Checking for all of them is impossible, you need to use the os.path APIs to determine if the paths you've got really relate in the ways you think they do. * os.pathsep can be, and actually sometimes is, embedded in a path. (again, more of a general path problem, not really python's fault) * relative path manipulation is difficult. ever tried to write the function to iterate two separate trees of files in parallel? shutil re-implements this twice completely differently via recursion, and it's harder to do with a generator (which is what you really want). you can't really split on os.sep and have it be correct due to the aforementioned windows-path issue, but that's what everybody does anyway. * os.path.split doesn't work anything like str.split. FS manipulation: * although individual operations are atomic, shutil.copytree and friends aren't. I've often seen python programs confused by partially-copied trees of files. This isn't even really an atomicity issue; it's often due to a traceback in the middle of a running python program which leaves the tree half-broken. * the documentation really can't emphasize enough how bad using 'os.path.exists/isfile/isdir', and then assuming the file continues to exist when it is a contended resource, is. It can be handy, but it is _always_ a race condition.
I promise to make a nice FAQ entry out of it, with proper attribution.
Thanks. The list here is just a brain dump, I'm not sure it's all appropriate for a FAQ, but I hope some of it is useful.
On 11/1/06, glyph@divmod.com <glyph@divmod.com> wrote:
On 06:14 pm, fredrik@pythonware.com wrote:
glyph@divmod.com wrote:
I assert that it needs a better[1] interface because the current interface can lead to a variety of bugs through idiomatic, apparently correct usage. All the more because many of those bugs are related to critical errors such as security and data integrity.
instead of referring to some esoteric knowledge about file systems that us non-twisted-using mere mortals may not be evolved enough to under- stand,
On the contrary, twisted users understand even less, because (A) we've been demonstrated to get it wrong on numerous occasions in highly public and embarrassing ways and (B) we already have this class that does it all for us and we can't remember how it works :-).
This is ironic coming from one of Python's celebrity geniuses. "We made this class but we don't know how it works." Actually, it's downright alarming coming from someone who knows Twisted inside and out yet still can't make sense of path patform oddities.
* This is confusing as heck:
os.path.join("hello", "/world") '/world'
That's in the documentation. I'm not sure it's "wrong". What should it do in this situation? Pretend the slash isn't there? This came up in the directory-tuple proposal. I said there was no reason to change the existing behavior of join. Noam favored an exception.
os.path.join("hello", "slash/world") 'hello/slash/world'
That has always been a loophole in the function, and many programs depend on it. Again, is it "wrong"? Should an embedded separator in an argument be an error? Obviously this depends on the user's knowledge that the separator happens to be slash.
os.path.join("hello", "slash//world") 'hello/slash//world'
Again a case of what "should" it do? The filesystem treats it as a single slash. The user didn't call normpath, so should we normalize it anyway?
* Sometimes a path isn't a path; the zip "paths" in sys.path are a good example. This is why I'm a big fan of including a polymorphic interface of some kind: this information is *already* being persisted in an ad-hoc and broken way now, so it needs to be represented; it would be good if it were actually represented properly. URL manipulation-as-path-manipulation is another; the recent perforce use-case mentioned here is a special case of that, I think.
Good point, but exactly what functionality do you want to see for zip files and URLs? Just pathname manipulation? Or the ability to see whether a file exists and extract it, copy it, etc?
* you have to care about unicode sometimes. rarely enough that none of your tests will ever account for it, but often enough that _some_ users will notice breakage if your code is ever widely distributed.
This is a Python-wide problem. The move to universal unicode will lessen this, or at least move the problem to *one* place (creating the unicode object), where every Python programmer will get bitten by it and we'll develop a few standard strategies to deal with it. (The problem is that if str and unicode are mixed in expressions, Python will promote the str to unicode and you'll get a UnicodeDecodeError if it contains non-ASCII characters. Figuring out all the ways such strings can slip into a program is difficult if you're dealing with user strings from an unknown charset, or your MySQL server is configured differently than you thought it was, or the string contains Windows curly quotes et al which are undefined in Latin-1.)
* the documentation really can't emphasize enough how bad using 'os.path.exists/isfile/isdir', and then assuming the file continues to exist when it is a contended resource, is. It can be handy, but it is _always_ a race condition.
What else can you do? It's either os.path.exists()/os.remove() or "do it anyway and catch the exception". And sometimes you have to check the filetype in order to determine *what* to do. -- Mike Orr <sluggoster@gmail.com>
Mike Orr wrote:
* This is confusing as heck:
os.path.join("hello", "/world") '/world'
It's only confusing if you're not thinking of pathnames as abstract entities. There's a reason for this behaviour -- it's so you can do things like full_path = os.path.join(default_dir, filename_from_user) where filename_from_user can be either a relative or absolute path at his discretion. In other words, os.path.join doesn't just mean "join these two paths together", it means "interpret the second path in the context of the first". Having said that, I can see there could be an element of confusion in calling it "join". -- Greg
Greg Ewing wrote:
Mike Orr wrote:
* This is confusing as heck:
os.path.join("hello", "/world") '/world'
It's only confusing if you're not thinking of pathnames as abstract entities.
There's a reason for this behaviour -- it's so you can do things like
full_path = os.path.join(default_dir, filename_from_user)
where filename_from_user can be either a relative or absolute path at his discretion.
In other words, os.path.join doesn't just mean "join these two paths together", it means "interpret the second path in the context of the first".
Having said that, I can see there could be an element of confusion in calling it "join".
Good point. "relativise" might be appropriate, though something shorter would be better. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Skype: holdenweb http://holdenweb.blogspot.com Recent Ramblings http://del.icio.us/steve.holden
Steve Holden wrote:
Greg Ewing wrote:
Having said that, I can see there could be an element of confusion in calling it "join".
Good point. "relativise" might be appropriate,
Sounds like something to make my computer go at warp speed, which would be nice, but I won't be expecting a patch any time soon. :-) -- Greg
Steve Holden wrote:
Greg Ewing wrote:
Mike Orr wrote: Having said that, I can see there could be an element of confusion in calling it "join".
Good point. "relativise" might be appropriate, though something shorter would be better.
regards Steve
The term used in many languages for this sort of operation is "combine". (See .Net System.IO.Path for an example.) I kind of like the term - it implies that you are mixing two paths together, but it doesn't imply that the combination will be additive. - Talin
glyph@divmod.com wrote:
os.path.join("hello", "slash/world") 'hello/slash/world' os.path.join("hello", "slash//world") 'hello/slash//world' Trying to formulate a general rule for what the arguments to os.path.join are supposed to be is really hard.
If you're serious about writing platform-agnostic pathname code, you don't put slashes in the arguments at all. Instead you do os.path.join("hello", "slash", "world") Many of the other things you mention are also a result of not treating pathnames as properly opaque objects. If you're saying that the fact they're strings makes it easy to forget that you're supposed to be treating them opaquely, there may be merit in that view. It would be an argument for making path objects a truly opaque type instead of a subclass of string or tuple.
* although individual operations are atomic, shutil.copytree and friends aren't. I've often seen python programs confused by partially-copied trees of files.
I can't see how this can be even remotely regarded as a pathname issue, or even a filesystem interface issue. It's no different to any other situation where a piece of code can fall over and leave a partial result behind. As always, the cure is defensive coding (clean up a partial result on error, or be prepared to tolerate the presence of a previous partial result when re-trying). It could be argued that shutil.copytree should clean up after itself if there is an error, but that might not be what you want -- e.g. you might want to find out how far it got, and maybe carry on from there next time. It's probably better to leave things like that to the caller. -- Greg
glyph:
Path manipulation:
* This is confusing as heck:
os.path.join("hello", "/world") '/world' os.path.join("hello", "slash/world") 'hello/slash/world' os.path.join("hello", "slash//world") 'hello/slash//world' Trying to formulate a general rule for what the arguments to os.path.join are supposed to be is really hard. I can't really figure out what it would be like on a non-POSIX/non-win32 platform.
Made trickier by the similar yet different behaviour of urlparse.urljoin.
import urlparse urlparse.urljoin("hello", "/world") '/world' urlparse.urljoin("hello", "slash/world") 'slash/world' urlparse.urljoin("hello", "slash//world") 'slash//world'
It does not make sense to me that these should be different. Andrew dalke@dalkescientific.com [Apologies to glyph for the dup; mixed up the reply-to. Still getting used to gmail.]
Andrew Dalke wrote:
glyph:
Path manipulation:
* This is confusing as heck:
os.path.join("hello", "/world") '/world' os.path.join("hello", "slash/world") 'hello/slash/world' os.path.join("hello", "slash//world") 'hello/slash//world' Trying to formulate a general rule for what the arguments to os.path.join are supposed to be is really hard. I can't really figure out what it would be like on a non-POSIX/non-win32 platform.
Made trickier by the similar yet different behaviour of urlparse.urljoin.
import urlparse urlparse.urljoin("hello", "/world") '/world' urlparse.urljoin("hello", "slash/world") 'slash/world' urlparse.urljoin("hello", "slash//world") 'slash//world'
It does not make sense to me that these should be different.
Although the last two smell like bugs, the point of urljoin is to make an absolute URL from an absolute ("current page") URL and a relative (link) one. As we see:
urljoin("/hello", "slash/world") '/slash/world'
and
urljoin("http://localhost/hello", "slash/world") 'http://localhost/slash/world'
but
urljoin("http://localhost/hello/", "slash/world") 'http://localhost/hello/slash/world' urljoin("http://localhost/hello/index.html", "slash/world") 'http://localhost/hello/slash/world'
I think we can probably conclude that this is what's supposed to happen. In the case of urljoin the first argument is interpreted as referencing an existing resource and the second as a link such as might appear in that resource. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Skype: holdenweb http://holdenweb.blogspot.com Recent Ramblings http://del.icio.us/steve.holden
Steve Holden wrote:
Although the last two smell like bugs, the point of urljoin is to make an absolute URL from an absolute ("current page") URL
also known as a base URL: http://www.w3.org/TR/html4/struct/links.html#h-12.4.1 (os.path.join's behaviour is also well-defined, btw; if any component is an absolute path, all preceding components are ignored.) </F>
Andrew Dalke schrieb:
import urlparse urlparse.urljoin("hello", "/world") '/world' urlparse.urljoin("hello", "slash/world") 'slash/world' urlparse.urljoin("hello", "slash//world") 'slash//world'
It does not make sense to me that these should be different.
Just in case this isn't clear from Steve's and Fredrik's post: The behaviour of this function is (or should be) specified, by an IETF RFC. If somebody finds that non-intuitive, that's likely because their mental model of relative URIs deviate's from the RFC's model. Of course, there is also the chance that the implementation deviates from the RFC; that would be a bug. Regards, Martin
Martin:
Just in case this isn't clear from Steve's and Fredrik's post: The behaviour of this function is (or should be) specified, by an IETF RFC. If somebody finds that non-intuitive, that's likely because their mental model of relative URIs deviate's from the RFC's model.
While I didn't realize that urljoin is only supposed to be used with a base URL, where "base URL" (used in the docstring) has a specific requirement that it be absolute. I instead saw the word "join" and figured it's should do roughly the same things as os.path.join.
import urlparse urlparse.urljoin("file:///path/to/hello", "slash/world") 'file:///path/to/slash/world' urlparse.urljoin("file:///path/to/hello", "/slash/world") 'file:///slash/world' import os os.path.join("/path/to/hello", "slash/world") '/path/to/hello/slash/world'
It does not. My intuition, nowadays highly influenced by URLs, is that with a couple of hypothetical functions for going between filenames and URLs: os.path.join(absolute_filename, filename) == file_url_to_filename(urlparse.urljoin( filename_to_file_url(absolute_filename), filename_to_file_url(filename))) which is not the case. os.join assumes the base is a directory name when used in a join: "inserting '/' as needed" while RFC 1808 says The last segment of the base URL's path (anything following the rightmost slash "/", or the entire path if no slash is present) is removed Is my intuition wrong in thinking those should be the same? I suspect it is. I've been very glad that when I ask for a directory name that I don't need to check that it ends with a "/". Urljoin's behaviour is correct for what it's doing. os.path.join is better for what it's doing. (And about once a year I manually verify the difference because I get unsure.) I think these should not share the "join" in the name. If urljoin is not meant for relative base URLs, should it raise an exception when misused? Hmm, though the RFC algorithm does not have a failure mode and the result may be a relative URL. Consider
urlparse.urljoin("http://blah.com/a/b/c", "..") 'http://blah.com/a/' urlparse.urljoin("http://blah.com/a/b/c", "../") 'http://blah.com/a/' urlparse.urljoin("http://blah.com/a/b/c", "../..") 'http://blah.com/' urlparse.urljoin("http://blah.com/a/b/c", "../../") 'http://blah.com/' urlparse.urljoin("http://blah.com/a/b/c", "../../..") 'http://blah.com/' urlparse.urljoin("http://blah.com/a/b/c", "../../../") 'http://blah.com/../' urlparse.urljoin("http://blah.com/a/b/c", "../../../..") # What?! 'http://blah.com/' urlparse.urljoin("http://blah.com/a/b/c", "../../../../") 'http://blah.com/../../'
Of course, there is also the chance that the implementation deviates from the RFC; that would be a bug.
The comment in urlparse # XXX The stuff below is bogus in various ways... is ever so reassuring. I suspect there's a bug given the previous code. Or I've a bad mental model. ;) Andrew dalke@dalkescientific.com
participants (8)
-
"Martin v. Löwis"
-
Andrew Dalke
-
Fredrik Lundh
-
glyph@divmod.com
-
Greg Ewing
-
Mike Orr
-
Steve Holden
-
Talin