[Python-Dev] Alternative path suggestion

Thu May 4 15:28:29 CEST 2006

on 04.05.2006 14:57 Nick Coghlan said the following:
> Mike Orr wrote:
>> Intriguing idea, Noam, and excellent thinking.  I'd say it's worth a
>> separate PEP.  It's too different to fit into PEP 355, and too big to
>> be summarized in the "Open Issues" section.  Of course, one PEP will
>> be rejected if the other is approved.
> 
> I agree that a competing PEP is probably the best way to track this idea.

<snip>

>>> This means that path objects aren't the string representation of a
>>> path; they are a ''logical'' representation of a path. Remember why a
>>> filesystem path is called a path - because it's a way to get from one
>>> place on the filesystem to another. Paths can be relative, which means
>>> that they don't define from where to start the walk, and can be not
>>> relative, which means that they do. In the tuple representation,
>>> relative paths are simply tuples of strings, and not relative paths
>>> are tuples of strings with a first "root" element.
> 
> I suggest storing the first element separately from the rest of the path. The 
> reason for suggesting this is that you use 'os.sep' to separate elements in 
> the normal path, but *not* to separate the first element from the rest.

I want to add that people might want to manipulate paths that are not 
for the currently running OS. Therefore I think the `sep` should be an 
attribute of the "root" element.
For the same reason I'd like to add two values to the following list:

> Possible values for the path's root element would then be:
> 
>    None ==> relative path
               (uses os.sep)
+    path.UNIXRELATIVE ==> uses '/'
+    path.WINDOWSRELATIVE ==> uses r'\' unconditionally
>    path.ROOT ==> Unix absolute path
>    path.DRIVECWD ==> Windows drive relative path
>    path.DRIVEROOT ==> Windows drive absolute path
>    path.UNCSHARE  ==> UNC path
>    path.URL  ==> URL path
> 
> The last four would have attributes (the two Windows ones to get at the drive 
> letter, the UNC one to get at the share name, and the URL one to get at the 
> text of the URL).
> 
> Similarly, I would separate out the extension to a distinct attribute, as it 
> too uses a different separator from the normal path elements ('.' most places, 
> but '/' on RISC OS, for example)
> 
> The string representation would then be:
> 
>    def __str__(self):
>        return (str(self.root)
>                + os.sep.join(self.path)
>                + os.extsep + self.ext)
> 
>>> The advantage of using a logical representation is that you can forget
>>> about the textual representation, which can be really complex.
> 
> As noted earlier - text is a great format for path related I/O. It's a lousy 
> format for path manipulation.
> 
>>> {{{
>>> p.normpath()  -> Isn't needed - done by the constructor
>>> p.basename()  -> p[-1]
>>> p.splitpath() -> (p[:-1], p[-1])
>>> p.splitunc()  -> (p[0], p[1:]) (if isinstance(p[0], path.UNCRoot))
>>> p.splitall()  -> Isn't needed
>>> p.parent      -> p[:-1]
>>> p.name        -> p[-1]
>>> p.drive       -> p[0] (if isinstance(p[0], path.Drive))
>>> p.uncshare    -> p[0] (if isinstance(p[0], path.UNCRoot))
>>> }}}
> 
> These same operations using separate root and path attributes:
> 
> p.basename()  -> p[-1]
> p.splitpath() -> (p[:-1], p[-1])
> p.splitunc()  -> (p.root, p.path)
> p.splitall()  -> Isn't needed
> p.parent      -> p[:-1]
> p.name        -> p[-1]
> p.drive       -> p.root.drive  (AttributeError if not drive based)
> p.uncshare    -> p.root.share  (AttributeError if not drive based)
> 
>> That's a big drawback.  PEP 355 can choose between string and
>> non-string, but this way is limited to non-string.  That raises the
>> minor issue of changing the open() functions etc in the standard
>> library, and the major issue of changing them in third-party
>> libraries.
> 
> It's not that big a drama, really. All you need to do is call str() on your 
> path objects when you're done manipulating them. The third party libraries 
> don't need to know how you created your paths, only what you ended up with.
> 
> Alternatively, if the path elements are stored in separate attributes, there's 
> nothing stopping the main object from inheriting from str or unicode the way 
> the PEP 355 path object does.
> 
> Either way, this object would still be far more convenient for manipulating 
> paths than a string based representation that has to deal with OS-specific 
> issues on every operation, rather than only during creation and conversion to 
> a string. The path objects would also serve as an OS-independent 
> representation of filesystem paths.
> 
> In fact, I'd leave most of the low-level API's working only on strings - the 
> only one I'd change to accept path objects directly is open() (which would be 
> fairly easy, as that's a factory function now).
> 
>>> This means that paths starting with a drive letter alone
>>> (!UnrootedDrive instance, in my module) and paths starting with a
>>> backslash alone (the CURROOT object, in my module) are not relative
>>> and not absolute.
>> I guess that's plausable.  We'll need feedback from Windows users.
> 
> As suggested above, I think the root element should be stored separately from 
> the rest of the path. Then adding a new kind of root element (such as a URL) 
> becomes trivial.
> 
>> The question is, does forcing people to use .stat() expose an
>> implementation detail that should be hidden, and does it smell of
>> Unixism?  Most people think a file *is* a regular file or a directory.
>>  The fact that this is encoded in the file's permission bits -- which
>> stat() examines -- is a quirk of Unix.
> 
> I wouldn't expose stat() - as you say, it's a Unixism. Instead, I'd provide a 
> subclass of Path that used lstat instead of stat for symbolic links.
> 
> So if I want symbolic links followed, I use the normal Path class. This class 
> just generally treat symbolic links as if they were the file pointed to 
> (except for the whole not recursing into symlinked subdirectories thing).
> 
> The SymbolicPath subclass would treat normal files as usual, but *wouldn't* 
> follow symbolic links when stat'ting files (instead, it would stat the symlink).
> 
>>> == One Method for Finding Files ==
>>>
>>> (They're actually two, but with exactly the same interface). The
>>> original path object has these methods for finding files:
>>>
>>> {{{
>>> def listdir(self, pattern = None): ...
>>> def dirs(self, pattern = None): ...
>>> def files(self, pattern = None): ...
>>> def walk(self, pattern = None): ...
>>> def walkdirs(self, pattern = None): ...
>>> def walkfiles(self, pattern = None): ...
>>> def glob(self, pattern):
>>> }}}
>>>
>>> I suggest one method that replaces all those:
>>> {{{
>>> def glob(self, pattern='*', topdown=True, onlydirs=False, onlyfiles=False): ...
>>> }}}
> 
> Swiss army methods are even more evil than wide APIs. And I consider the term 
> 'glob' itself to be a Unixism - I've found the technique to be far more 
> commonly known as wildcard matching in the Windows world.
> 
> The path module has those methods for 7 distinct use cases:
>    - list the contents of this directory
>    - list the subdirectories of this directory
>    - list the files in this directory
>    - walk the directory tree rooted at this point, yielding both files and dirs
>    - walk the directory tree rooted at this point, yielding only the dirs
>    - walk the directory tree rooted at this point, yielding only the files
>    - walk this pattern
> 
> The first 3 operations are far more common than the last 4, so they need to stay.
> 
>    def entries(self, pattern=None):
>        """Return list of all entries in directory"""
>        _path = type(self)
>        all_entries = os.listdir(str(self))
>        if pattern is not None:
>            return [_path(x) for x in all_entries if x.matches(pattern)]
>        return [_path(x) for x in all_entries]
> 
>    def subdirs(self, pattern=None)
>        """Return list of all subdirectories in directory"""
>        return [x for x in self.entries(pattern) if x.is_dir()]
> 
>    def files(self, pattern=None)
>        """Return list of all files in directory"""
>        return [x for x in self.entries(pattern) if x.is_dir()]
> 
>    # here's sample implementations of the test methods used above
>    def matches(self, pattern):
>        return fnmatch.fnmatch(str(self), pattern)
>    def is_dir(self):
>        return os.isdir(str(self))
>    def is_file(self):
>        return os.isfile(str(self))
> 
> For the tree traversal operations, there are still multiple use cases:
> 
>    def walk(self, topdown=True, onerror=None)
>        """ Walk directories and files just as os.walk does"""
>        # Similar to os.walk, only yielding Path objects instead of strings
>        # For each directory, effectively returns:
>        #    yield dirpath, dirpath.subdirs(), dirpath.files()
> 
> 
>    def walkdirs(self, pattern=None, onerror=None)
>        """Only walk directories matching pattern"""
>        for dirpath, subdirs, files in self.walk(onerror=onerror):
>            yield dirpath
>            if pattern is not None:
>                # Modify in-place so that walk() responds to the change
>                subdirs[:] = [x for x in subdirs if x.matches(pattern)]
> 
>    def walkfiles(self, pattern=None, onerror=None)
>        """Only walk file names matching pattern"""
>        for dirpath, subdirs, files in self.walk(onerror=onerror):
>            if pattern is not None:
>                for f in files:
>                    if f.match(pattern):
>                        yield f
>            else:
>                for f in files:
>                    yield f
> 
>    def walkpattern(self, pattern=None)
>        """Only walk paths matching glob pattern"""
>        _factory = type(self)
>        for pathname in glob.glob(pattern):
>            yield _factory(pathname)
> 
> 
>>> pattern is the good old glob pattern, with one additional extension:
>>> "**" matches any number of subdirectories, including 0. This means
>>> that '**' means "all the files in a directory", '**/a' means "all the
>>> files in a directory called a", and '**/a*/**/b*' means "all the files
>>> in a directory whose name starts with 'b' and the name of one of their
>>> parent directories starts with 'a'".
>> I like the separate methods, but OK.  I hope it doesn't *really* call
>> glob if the pattern is the default.
> 
> Keep the separate methods. Trying to squeeze too many disparate use cases 
> through a single API is a bad idea. Directory listing and tree-traversal are 
> not the same thing. Path matching and filename matching are not the same thing 
> either.
> 
>> Or one could, gasp, pass a constant or the 'find' command's
>> abbreviation ("d" directory, "f" file, "s" socket, "b" block
>> special...).
> 
> Magic letters in an API are just as bad as magic numbers :)
> 
> More importantly, these things don't port well between systems.
> 
>>> In my proposal:
>>>
>>> {{{
>>> def copy(self, dst, copystat=False): ...
>>> }}}
>>>
>>> It's just that I think that copyfile, copymode and copystat aren't
>>> usually useful, and there's no reason not to unite copy and copy2.
>> Sounds good.
> 
> OK, this is one case where a swiss army method may make sense. Specifically, 
> something like:
> 
>    def copy_to(self, dest, copyfile=True, copymode=True, copytime=False)
> 
> Whether or not to copy the file contents, the permission settings and the last 
> access and modification time are then all independently selectable.
> 
> The different method name also makes the direction of the copying clear (with 
> a bare 'copy', it's slightly ambiguous as the 'cp src dest' parallel isn't as 
> strong as it is with a function).
> 
>> I was wondering what the fallout would be of normalizing "a/../b" and
>> "a/./b" and "a//b", but it sounds like you're thinking about it.
> 
> The latter two are OK, but normalizing the first one can give you the wrong 
> answer if 'a' is a symlink (since 'a/../b' is then not necessarily the same as 
> 'b').
> 
> Better to just leave the '..' in and not treat it as something that can be 
> normalised away.
> 
>>> I removed the methods associated with file extensions. I don't recall
>>> using them, and since they're purely textual and not OS-dependent, I
>>> think that you can always do p[-1].rsplit('.', 1).
> 
> Most modern OS's use '.' as the extension separator, true, but os.extsep still 
> exists for a reason :)
> 
>> .namebase is an obnoxious name though.  I wish we could come up with
>> something better.
> 
> p.path[-1] :)
> 
> Then p.name can just do (p.path[-1] + os.extsep + p.ext) to rebuild the full 
> filename including the extension (if p.ext was None, then p.name would be the 
> same as p.path[-1])
> 
>>> I removed expand. There's no need to use normpath, so it's equivalent
>>> to .expanduser().expandvars(), and I think that the explicit form is
>>> better.
>> Expand is useful though, so you don't forget one or the other.
> 
> And as you'll usually want to do both, adding about 15 extra characters for no 
> good reason seems like a bad idea. . .
> 
>>> copytree - I removed it. In shutil it's documented as being mostly a
>>> demonstration, and I'm not sure if it's really useful.
>> Er, not sure I've used it, but it seems useful.  Why force people to
>> reinvent the wheel with their own recursive loops that they may get
>> wrong?
> 
> Because the handling of exceptional cases is almost always going to be 
> application specific. Note that even os.walk provides a callback hook for if 
> the call to os.listdir() fails when attempting to descend into a directory.
> 
> For copytree, the issues to be considered are significantly worse:
>    - what to do if listdir fails in the source tree?
>    - what to do if reading a file fails in the source tree?
>    - what to do if a directory doesn't exist in the target tree?
>    - what to do if a directory already exists in the target tree?
>    - what to do if a file already exists in the target tree?
>    - what to do if writing a file fails in the target tree?
>    - should the file contents/mode/time be copied to the target tree?
>    - what to do with symlinks in the source tree?
> 
> Now, what might potentially be genuinely useful is paired walk methods that 
> allowed the following:
> 
>    # Do path.walk over this directory, and also return the corresponding
>    # information for a destination directory (so the dest dir information
>    # probably *won't* match that file system
>    for src_info, dest_info in src_path.pairedwalk(dest_path):
>        src_dirpath, src_subdirs, src_files = src_info
>        dest_dirpath, dest_subdirs, dest_files = dest_info
>        # Do something useful
> 
>    # Ditto for path.walkdirs
>    for src_dirpath, dest_dirpath in src_path.pairedwalkdirs(dest_path):
>        # Do something useful
> 
>    # Ditto for path.walkfiles
>    for src_path, dest_path in src_path.pairedwalkfiles(dest_path):
>        src_path.copy_to(dest_path)
> 
>> You've got two issues here.  One is to go to a tuple base and replace
>> several properties with slicing.  The other is all your other proposed
>> changes.  Ideally the PEP would be written in a way that these other
>> changes can be propagated back and forth between the PEPs as consensus
>> builds.
> 
> The main thing Jason's path object has going for it is that it brings together 
> Python's disparate filesystem manipulation API's into one place. Using it is a 
> definite improvement over using the standard lib directly.
> 
> However, the choice to use a string as the internal storage instead a more 
> appropriate format (such as the three-piece structure I suggest of root, path, 
> extension), it doesn't do as much as it could to abstract away the hassles of 
> os.sep and os.extsep.
> 
> By focusing on the idea that strings are for path input and output operations, 
> rather than for path manipulation, it should be possible to build something 
> even more usable than path.py
> 
> If it was done within the next several months and released on PyPI, it might 
> even be a contender for 2.6.
> 
> Cheers,
> Nick.
>