[Python-Dev] Alternative path suggestion
Stefan Rank
stefan.rank at ofai.at
Thu May 4 15:28:29 CEST 2006
on 04.05.2006 14:57 Nick Coghlan said the following:
> Mike Orr wrote:
>> Intriguing idea, Noam, and excellent thinking. I'd say it's worth a
>> separate PEP. It's too different to fit into PEP 355, and too big to
>> be summarized in the "Open Issues" section. Of course, one PEP will
>> be rejected if the other is approved.
>
> I agree that a competing PEP is probably the best way to track this idea.
<snip>
>>> This means that path objects aren't the string representation of a
>>> path; they are a ''logical'' representation of a path. Remember why a
>>> filesystem path is called a path - because it's a way to get from one
>>> place on the filesystem to another. Paths can be relative, which means
>>> that they don't define from where to start the walk, and can be not
>>> relative, which means that they do. In the tuple representation,
>>> relative paths are simply tuples of strings, and not relative paths
>>> are tuples of strings with a first "root" element.
>
> I suggest storing the first element separately from the rest of the path. The
> reason for suggesting this is that you use 'os.sep' to separate elements in
> the normal path, but *not* to separate the first element from the rest.
I want to add that people might want to manipulate paths that are not
for the currently running OS. Therefore I think the `sep` should be an
attribute of the "root" element.
For the same reason I'd like to add two values to the following list:
> Possible values for the path's root element would then be:
>
> None ==> relative path
(uses os.sep)
+ path.UNIXRELATIVE ==> uses '/'
+ path.WINDOWSRELATIVE ==> uses r'\' unconditionally
> path.ROOT ==> Unix absolute path
> path.DRIVECWD ==> Windows drive relative path
> path.DRIVEROOT ==> Windows drive absolute path
> path.UNCSHARE ==> UNC path
> path.URL ==> URL path
>
> The last four would have attributes (the two Windows ones to get at the drive
> letter, the UNC one to get at the share name, and the URL one to get at the
> text of the URL).
>
> Similarly, I would separate out the extension to a distinct attribute, as it
> too uses a different separator from the normal path elements ('.' most places,
> but '/' on RISC OS, for example)
>
> The string representation would then be:
>
> def __str__(self):
> return (str(self.root)
> + os.sep.join(self.path)
> + os.extsep + self.ext)
>
>>> The advantage of using a logical representation is that you can forget
>>> about the textual representation, which can be really complex.
>
> As noted earlier - text is a great format for path related I/O. It's a lousy
> format for path manipulation.
>
>>> {{{
>>> p.normpath() -> Isn't needed - done by the constructor
>>> p.basename() -> p[-1]
>>> p.splitpath() -> (p[:-1], p[-1])
>>> p.splitunc() -> (p[0], p[1:]) (if isinstance(p[0], path.UNCRoot))
>>> p.splitall() -> Isn't needed
>>> p.parent -> p[:-1]
>>> p.name -> p[-1]
>>> p.drive -> p[0] (if isinstance(p[0], path.Drive))
>>> p.uncshare -> p[0] (if isinstance(p[0], path.UNCRoot))
>>> }}}
>
> These same operations using separate root and path attributes:
>
> p.basename() -> p[-1]
> p.splitpath() -> (p[:-1], p[-1])
> p.splitunc() -> (p.root, p.path)
> p.splitall() -> Isn't needed
> p.parent -> p[:-1]
> p.name -> p[-1]
> p.drive -> p.root.drive (AttributeError if not drive based)
> p.uncshare -> p.root.share (AttributeError if not drive based)
>
>> That's a big drawback. PEP 355 can choose between string and
>> non-string, but this way is limited to non-string. That raises the
>> minor issue of changing the open() functions etc in the standard
>> library, and the major issue of changing them in third-party
>> libraries.
>
> It's not that big a drama, really. All you need to do is call str() on your
> path objects when you're done manipulating them. The third party libraries
> don't need to know how you created your paths, only what you ended up with.
>
> Alternatively, if the path elements are stored in separate attributes, there's
> nothing stopping the main object from inheriting from str or unicode the way
> the PEP 355 path object does.
>
> Either way, this object would still be far more convenient for manipulating
> paths than a string based representation that has to deal with OS-specific
> issues on every operation, rather than only during creation and conversion to
> a string. The path objects would also serve as an OS-independent
> representation of filesystem paths.
>
> In fact, I'd leave most of the low-level API's working only on strings - the
> only one I'd change to accept path objects directly is open() (which would be
> fairly easy, as that's a factory function now).
>
>>> This means that paths starting with a drive letter alone
>>> (!UnrootedDrive instance, in my module) and paths starting with a
>>> backslash alone (the CURROOT object, in my module) are not relative
>>> and not absolute.
>> I guess that's plausable. We'll need feedback from Windows users.
>
> As suggested above, I think the root element should be stored separately from
> the rest of the path. Then adding a new kind of root element (such as a URL)
> becomes trivial.
>
>> The question is, does forcing people to use .stat() expose an
>> implementation detail that should be hidden, and does it smell of
>> Unixism? Most people think a file *is* a regular file or a directory.
>> The fact that this is encoded in the file's permission bits -- which
>> stat() examines -- is a quirk of Unix.
>
> I wouldn't expose stat() - as you say, it's a Unixism. Instead, I'd provide a
> subclass of Path that used lstat instead of stat for symbolic links.
>
> So if I want symbolic links followed, I use the normal Path class. This class
> just generally treat symbolic links as if they were the file pointed to
> (except for the whole not recursing into symlinked subdirectories thing).
>
> The SymbolicPath subclass would treat normal files as usual, but *wouldn't*
> follow symbolic links when stat'ting files (instead, it would stat the symlink).
>
>>> == One Method for Finding Files ==
>>>
>>> (They're actually two, but with exactly the same interface). The
>>> original path object has these methods for finding files:
>>>
>>> {{{
>>> def listdir(self, pattern = None): ...
>>> def dirs(self, pattern = None): ...
>>> def files(self, pattern = None): ...
>>> def walk(self, pattern = None): ...
>>> def walkdirs(self, pattern = None): ...
>>> def walkfiles(self, pattern = None): ...
>>> def glob(self, pattern):
>>> }}}
>>>
>>> I suggest one method that replaces all those:
>>> {{{
>>> def glob(self, pattern='*', topdown=True, onlydirs=False, onlyfiles=False): ...
>>> }}}
>
> Swiss army methods are even more evil than wide APIs. And I consider the term
> 'glob' itself to be a Unixism - I've found the technique to be far more
> commonly known as wildcard matching in the Windows world.
>
> The path module has those methods for 7 distinct use cases:
> - list the contents of this directory
> - list the subdirectories of this directory
> - list the files in this directory
> - walk the directory tree rooted at this point, yielding both files and dirs
> - walk the directory tree rooted at this point, yielding only the dirs
> - walk the directory tree rooted at this point, yielding only the files
> - walk this pattern
>
> The first 3 operations are far more common than the last 4, so they need to stay.
>
> def entries(self, pattern=None):
> """Return list of all entries in directory"""
> _path = type(self)
> all_entries = os.listdir(str(self))
> if pattern is not None:
> return [_path(x) for x in all_entries if x.matches(pattern)]
> return [_path(x) for x in all_entries]
>
> def subdirs(self, pattern=None)
> """Return list of all subdirectories in directory"""
> return [x for x in self.entries(pattern) if x.is_dir()]
>
> def files(self, pattern=None)
> """Return list of all files in directory"""
> return [x for x in self.entries(pattern) if x.is_dir()]
>
> # here's sample implementations of the test methods used above
> def matches(self, pattern):
> return fnmatch.fnmatch(str(self), pattern)
> def is_dir(self):
> return os.isdir(str(self))
> def is_file(self):
> return os.isfile(str(self))
>
> For the tree traversal operations, there are still multiple use cases:
>
> def walk(self, topdown=True, onerror=None)
> """ Walk directories and files just as os.walk does"""
> # Similar to os.walk, only yielding Path objects instead of strings
> # For each directory, effectively returns:
> # yield dirpath, dirpath.subdirs(), dirpath.files()
>
>
> def walkdirs(self, pattern=None, onerror=None)
> """Only walk directories matching pattern"""
> for dirpath, subdirs, files in self.walk(onerror=onerror):
> yield dirpath
> if pattern is not None:
> # Modify in-place so that walk() responds to the change
> subdirs[:] = [x for x in subdirs if x.matches(pattern)]
>
> def walkfiles(self, pattern=None, onerror=None)
> """Only walk file names matching pattern"""
> for dirpath, subdirs, files in self.walk(onerror=onerror):
> if pattern is not None:
> for f in files:
> if f.match(pattern):
> yield f
> else:
> for f in files:
> yield f
>
> def walkpattern(self, pattern=None)
> """Only walk paths matching glob pattern"""
> _factory = type(self)
> for pathname in glob.glob(pattern):
> yield _factory(pathname)
>
>
>>> pattern is the good old glob pattern, with one additional extension:
>>> "**" matches any number of subdirectories, including 0. This means
>>> that '**' means "all the files in a directory", '**/a' means "all the
>>> files in a directory called a", and '**/a*/**/b*' means "all the files
>>> in a directory whose name starts with 'b' and the name of one of their
>>> parent directories starts with 'a'".
>> I like the separate methods, but OK. I hope it doesn't *really* call
>> glob if the pattern is the default.
>
> Keep the separate methods. Trying to squeeze too many disparate use cases
> through a single API is a bad idea. Directory listing and tree-traversal are
> not the same thing. Path matching and filename matching are not the same thing
> either.
>
>> Or one could, gasp, pass a constant or the 'find' command's
>> abbreviation ("d" directory, "f" file, "s" socket, "b" block
>> special...).
>
> Magic letters in an API are just as bad as magic numbers :)
>
> More importantly, these things don't port well between systems.
>
>>> In my proposal:
>>>
>>> {{{
>>> def copy(self, dst, copystat=False): ...
>>> }}}
>>>
>>> It's just that I think that copyfile, copymode and copystat aren't
>>> usually useful, and there's no reason not to unite copy and copy2.
>> Sounds good.
>
> OK, this is one case where a swiss army method may make sense. Specifically,
> something like:
>
> def copy_to(self, dest, copyfile=True, copymode=True, copytime=False)
>
> Whether or not to copy the file contents, the permission settings and the last
> access and modification time are then all independently selectable.
>
> The different method name also makes the direction of the copying clear (with
> a bare 'copy', it's slightly ambiguous as the 'cp src dest' parallel isn't as
> strong as it is with a function).
>
>> I was wondering what the fallout would be of normalizing "a/../b" and
>> "a/./b" and "a//b", but it sounds like you're thinking about it.
>
> The latter two are OK, but normalizing the first one can give you the wrong
> answer if 'a' is a symlink (since 'a/../b' is then not necessarily the same as
> 'b').
>
> Better to just leave the '..' in and not treat it as something that can be
> normalised away.
>
>>> I removed the methods associated with file extensions. I don't recall
>>> using them, and since they're purely textual and not OS-dependent, I
>>> think that you can always do p[-1].rsplit('.', 1).
>
> Most modern OS's use '.' as the extension separator, true, but os.extsep still
> exists for a reason :)
>
>> .namebase is an obnoxious name though. I wish we could come up with
>> something better.
>
> p.path[-1] :)
>
> Then p.name can just do (p.path[-1] + os.extsep + p.ext) to rebuild the full
> filename including the extension (if p.ext was None, then p.name would be the
> same as p.path[-1])
>
>>> I removed expand. There's no need to use normpath, so it's equivalent
>>> to .expanduser().expandvars(), and I think that the explicit form is
>>> better.
>> Expand is useful though, so you don't forget one or the other.
>
> And as you'll usually want to do both, adding about 15 extra characters for no
> good reason seems like a bad idea. . .
>
>>> copytree - I removed it. In shutil it's documented as being mostly a
>>> demonstration, and I'm not sure if it's really useful.
>> Er, not sure I've used it, but it seems useful. Why force people to
>> reinvent the wheel with their own recursive loops that they may get
>> wrong?
>
> Because the handling of exceptional cases is almost always going to be
> application specific. Note that even os.walk provides a callback hook for if
> the call to os.listdir() fails when attempting to descend into a directory.
>
> For copytree, the issues to be considered are significantly worse:
> - what to do if listdir fails in the source tree?
> - what to do if reading a file fails in the source tree?
> - what to do if a directory doesn't exist in the target tree?
> - what to do if a directory already exists in the target tree?
> - what to do if a file already exists in the target tree?
> - what to do if writing a file fails in the target tree?
> - should the file contents/mode/time be copied to the target tree?
> - what to do with symlinks in the source tree?
>
> Now, what might potentially be genuinely useful is paired walk methods that
> allowed the following:
>
> # Do path.walk over this directory, and also return the corresponding
> # information for a destination directory (so the dest dir information
> # probably *won't* match that file system
> for src_info, dest_info in src_path.pairedwalk(dest_path):
> src_dirpath, src_subdirs, src_files = src_info
> dest_dirpath, dest_subdirs, dest_files = dest_info
> # Do something useful
>
> # Ditto for path.walkdirs
> for src_dirpath, dest_dirpath in src_path.pairedwalkdirs(dest_path):
> # Do something useful
>
> # Ditto for path.walkfiles
> for src_path, dest_path in src_path.pairedwalkfiles(dest_path):
> src_path.copy_to(dest_path)
>
>> You've got two issues here. One is to go to a tuple base and replace
>> several properties with slicing. The other is all your other proposed
>> changes. Ideally the PEP would be written in a way that these other
>> changes can be propagated back and forth between the PEPs as consensus
>> builds.
>
> The main thing Jason's path object has going for it is that it brings together
> Python's disparate filesystem manipulation API's into one place. Using it is a
> definite improvement over using the standard lib directly.
>
> However, the choice to use a string as the internal storage instead a more
> appropriate format (such as the three-piece structure I suggest of root, path,
> extension), it doesn't do as much as it could to abstract away the hassles of
> os.sep and os.extsep.
>
> By focusing on the idea that strings are for path input and output operations,
> rather than for path manipulation, it should be possible to build something
> even more usable than path.py
>
> If it was done within the next several months and released on PyPI, it might
> even be a contender for 2.6.
>
> Cheers,
> Nick.
>
More information about the Python-Dev
mailing list