[Python-Dev] Path object design

glyph at divmod.com glyph at divmod.com
Wed Nov 1 22:57:24 CET 2006


On 06:14 pm, fredrik at pythonware.com wrote:
>glyph at divmod.com wrote:
>
>> I assert that it needs a better[1] interface because the current
>> interface can lead to a variety of bugs through idiomatic, apparently
>> correct usage.  All the more because many of those bugs are related to
>> critical errors such as security and data integrity.

>instead of referring to some esoteric knowledge about file systems that
>us non-twisted-using mere mortals may not be evolved enough to under-
>stand,

On the contrary, twisted users understand even less, because (A) we've been demonstrated to get it wrong on numerous occasions in highly public and embarrassing ways and (B) we already have this class that does it all for us and we can't remember how it works :-).

>maybe you could just make a list of common bugs that may arise
>due to idiomatic use of the existing primitives?

Here are some common gotchas that I can think of off the top of my head.  Not all of these are resolved by Twisted's path class:

Path manipulation:

 * This is confusing as heck:
   >>> os.path.join("hello", "/world")
   '/world'
   >>> os.path.join("hello", "slash/world")
   'hello/slash/world'
   >>> os.path.join("hello", "slash//world")
   'hello/slash//world'
   Trying to formulate a general rule for what the arguments to os.path.join are supposed to be is really hard.  I can't really figure out what it would be like on a non-POSIX/non-win32 platform.

 * it seems like slashes should be more aggressively converted to backslashes on windows, because it's near impossible to do anything with os.sep in the current situation.

 * "C:blah" does not mean what you think it means on Windows.  Regardless of what you think it means, it is not that.  I thought I understood it once as the current process having a current directory on every mapped drive, but then I had to learn about UNC paths of network mapped drives and it stopped making sense again.

 * There are special files on windows such as "CON" and "NUL" which exist in _every_ directory.  Twisted does get around this, by looking at the result of abspath:
   >>> os.path.abspath("c:/foo/bar/nul")
   '\\\\nul'

 * Sometimes a path isn't a path; the zip "paths" in sys.path are a good example.  This is why I'm a big fan of including a polymorphic interface of some kind: this information is *already* being persisted in an ad-hoc and broken way now, so it needs to be represented; it would be good if it were actually represented properly.  URL manipulation-as-path-manipulation is another; the recent perforce use-case mentioned here is a special case of that, I think.

 * paths can have spaces in them and there's no convenient, correct way to quote them if you want to pass them to some gross function like os.system - and a lot of the code that manipulates paths is shell-script-replacement crud which wants to call gross functions like os.system.  Maybe this isn't really the path manipulation code's fault, but it's where people start looking when they want properly quoted path arguments.

 * you have to care about unicode sometimes.  rarely enough that none of your tests will ever account for it, but often enough that _some_ users will notice breakage if your code is ever widely distributed.  this is an even more obscure example, but pygtk always reports pathnames in utf8-encoded *byte* strings, regardless of your filesystem encoding.  If you forget to decode/encode it, hilarity ensues.  There's no consistent error reporting (as far as I can tell, I have encountered this rarely) and no real way to detect this until you have an actual insanely-configured system with an insanely-named file on it to test with.  (Polymorphic interfaces might help a *bit* here.  At worst, they would at least make it possible to develop a canonical "insanely encoded filesystem" test-case backend.  At best, you'd absolutely have to work in terms of unicode all the time, and no implicit encoding issues would leak through to application code.)  Twisted's thing doesn't deal with this at all, and it really should.

 * also *sort* of an encoding issue, although basically only for webservers or other network-accessible paths: thanks to some of these earlier issues as well as %2e%2e, there are effectively multiple ways to spell "..".  Checking for all of them is impossible, you need to use the os.path APIs to determine if the paths you've got really relate in the ways you think they do.

 * os.pathsep can be, and actually sometimes is, embedded in a path.  (again, more  of a general path problem, not really python's fault)

 * relative path manipulation is difficult.  ever tried to write the function to iterate two separate trees of files in parallel?  shutil re-implements this twice completely differently via recursion, and it's harder to do with a generator (which is what you really want).  you can't really split on os.sep and have it be correct due to the aforementioned windows-path issue, but that's what everybody does anyway.

 * os.path.split doesn't work anything like str.split.

FS manipulation:

 * although individual operations are atomic, shutil.copytree and friends aren't.  I've often seen python programs confused by partially-copied trees of files.  This isn't even really an atomicity issue; it's often due to a traceback in the middle of a running python program which leaves the tree half-broken.

 * the documentation really can't emphasize enough how bad using 'os.path.exists/isfile/isdir', and then assuming the file continues to exist when it is a contended resource, is.  It can be handy, but it is _always_ a race condition.

>I promise to make a nice FAQ entry out of it, with proper attribution.

Thanks.  The list here is just a brain dump, I'm not sure it's all appropriate for a FAQ, but I hope some of it is useful.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-dev/attachments/20061101/b6052fc2/attachment.html 


More information about the Python-Dev mailing list