[Python-Dev] Alternative path suggestion

Wed May 3 01:02:14 CEST 2006

Hello,

I saw the discussion about including the path type in the standard
library. As it turned out, I recently wrote a program which does quite
a lot of path manipulation. This caused me to think that the proposed
path module:

 * Makes path manipulation significantly easier
 * Can be improved.

So I tried to write my version of it. My basic problem with the
current proposed path module is that it's a bit... messy. It contains
a lot of methods, collected from various modules, and for me it looks
too crowded - there are too many methods and too many details for me
to easily learn.

So I tried to organize it all. I think that the result may make file
and path manipulation really easier.

Here are my ideas. It's a copy of what I posted a few minutes ago in
the wiki - you can view it at
http://wiki.python.org/moin/AlternativePathClass (it looks better
there).

You can find the implementation at
http://wiki.python.org/moin/AlternativePathModule?action=raw
(By the way, is there some "code wiki" available? It can simply be a
public svn repository. I think it will be useful for those things.)

All these are ideas - I would like to hear what you think about them.

= Major Changes =

== a tuple instead of a string ==

The biggest conceptual change is that my path object is a subclass of
''tuple'', not a subclass of str. For example,
{{{
>>> tuple(path('a/b/c'))
('a', 'b', 'c')
>>> tuple(path('/a/b/c'))
(path.ROOT, 'a', 'b', 'c')
}}}

This means that path objects aren't the string representation of a
path; they are a ''logical'' representation of a path. Remember why a
filesystem path is called a path - because it's a way to get from one
place on the filesystem to another. Paths can be relative, which means
that they don't define from where to start the walk, and can be not
relative, which means that they do. In the tuple representation,
relative paths are simply tuples of strings, and not relative paths
are tuples of strings with a first "root" element.

The advantage of using a logical representation is that you can forget
about the textual representation, which can be really complex. You
don't have to call normpath when you're unsure about how a path looks,
you don't have to search for seps and altseps, and... you don't need
to remember a lot of names of functions or methods. To show that, take
a look at those methods from the original path class and their
equivalent in my path class:

{{{
p.normpath()  -> Isn't needed - done by the constructor
p.basename()  -> p[-1]
p.splitpath() -> (p[:-1], p[-1])
p.splitunc()  -> (p[0], p[1:]) (if isinstance(p[0], path.UNCRoot))
p.splitall()  -> Isn't needed
p.parent      -> p[:-1]
p.name        -> p[-1]
p.drive       -> p[0] (if isinstance(p[0], path.Drive))
p.uncshare    -> p[0] (if isinstance(p[0], path.UNCRoot))

and of course:
p.join(q) [or anything like it] -> p + q
}}}

The only drawback I can see in using a logical representation is that
giving a path object to functions which expect a path string won't
work. The immediate solution is to simply use str(p) instead of p. The
long-term solution is to make all related functions accept a path
object.

Having a logical representation of a path calls for a bit of term
clearing-up. What's an absolute path? On POSIX, it's very simple: a
path starting with a '/'. But what about Windows? Is "\temp\file" an
absolute path? I claim that it isn't really. The reason is that if you
change the current working directory, its meaning changes: It's now
not "c:\temp\file", but "a:\temp\file". The same goes for
"c:temp\file". So I decided on these two definitions:

 * A ''relative path'' is a path without a root element, so it can be
concatenated to other paths.
 * An ''absolute path'' is a path whose meaning doesn't change when
the current working directory changes.

This means that paths starting with a drive letter alone
(!UnrootedDrive instance, in my module) and paths starting with a
backslash alone (the CURROOT object, in my module) are not relative
and not absolute.

I really think that it's a better way to handle paths. If you want an
example, compare the current implementation of relpathto and my
implementation.

== Easier attributes for stat objects ==

The current path objects includes:
 * isdir, isfile, islink, and -
 * atime, mtime, ctime, size.
The first line does file mode checking, and the second simply gives
attributes from the stat object.

I suggest that these should be added to the stat_result object. isdir,
isfile and islink are true if a specific bit in st_mode is set, and
atime, mtime, ctime and size are simply other names for st_atime,
st_mtime, st_ctime and st_size.

It means that instead of using the atime, mtime etc. methods, you will
write {{{ p.stat().atime }}}, {{{ p.stat().size }}}, etc.

This is good, because:
 * If you want to make only one system call, it's very easy to save
the stat object and use it.
 * If you have to deal with symbolic links, you can simply use {{{
p.lstat().mtime }}}. Yes, symbolic links have a modification time. The
alternative is to add three methods with ugly names (latime, lmtime,
lctime) or to have an incomplete interface without a good reason.

I think that isfile, isdir should be kept (along with lisfile,
lisdir), since I think that doing what they do is quite common, and
requires six lines:
{{{
try:
    st = p.stat()
except OSError:
    return False
else:
    return st.isdir
}}}

I think that still, isdir, isfile and islink should be added to
stat_result objects: They turned out pretty useful in writing some of
the more complex path methods.

== One Method for Finding Files ==

(They're actually two, but with exactly the same interface). The
original path object has these methods for finding files:

{{{
def listdir(self, pattern = None): ...
def dirs(self, pattern = None): ...
def files(self, pattern = None): ...
def walk(self, pattern = None): ...
def walkdirs(self, pattern = None): ...
def walkfiles(self, pattern = None): ...
def glob(self, pattern):
}}}

I suggest one method that replaces all those:
{{{
def glob(self, pattern='*', topdown=True, onlydirs=False, onlyfiles=False): ...
}}}

pattern is the good old glob pattern, with one additional extension:
"**" matches any number of subdirectories, including 0. This means
that '**' means "all the files in a directory", '**/a' means "all the
files in a directory called a", and '**/a*/**/b*' means "all the files
in a directory whose name starts with 'b' and the name of one of their
parent directories starts with 'a'".

onlydirs and onlyfiles filter the results (they can't be combined, of
course). topdown has the same meaning as in os.walk (it isn't
supported by the original path class). So, let's show how these
methods can be replaced:

{{{
p.listdir()   -> p.glob()
p.dirs()      -> p.glob(onlydirs=1)
p.files()     -> p.glob(onlyfiles=1)
p.walk()      -> p.glob('**')
p.walkdirs()  -> p.glob('**', onlydirs=1)
p.walkfiles() -> p.glob('**', onlyfiles=1)
p.glob(patt)  -> p.glob(patt)
}}}

Now, for the promised additional method. The current implementation of
glob doesn't follow symbolic links. In my implementation, there's
"lglob", which does what the current glob does. However, the (default)
glob does follow symbolic links. To avoid infinite recursion, it keeps
the set of filesystem ids on the current path, and checks each dir to
see if it was already encountered. (It does so only if there's '**' in
the pattern, because otherwise a finite number of results is
guaranteed.) Note that it doesn't keep the ids of all the files
traversed, only those on the path from the base node to the current
node. This means that as long as there're no cycles, everything will
go fine - for example, 'a' and 'b' pointing at the same dir will just
cause the same files to be reported twice, once under 'a' and once
under 'b'. One last note: On windows there are no file ids, but there
are no symbolic links, so everything is fine.

Oh, and it returns an iterator, not a list.

== Separation of Calculations and System Calls ==

I like to know when I'm using system calls and when I don't. It turns
out that using tuples instead of strings makes it possible to define
all operations which do not use system calls as properties or
operators, and all operations which do use system calls as methods.

The only exception currently is .match(). What can I do?

== Reduce the Number of Methods ==

I think that the number of methods should be reduced. The most obvious
example are the copy functions. In the current proposal:

{{{
def copyfile(self, dst): ...
def copymode(self, dst): ...
def copystat(self, dst): ...
def copy(self, dst): ...
def copy2(self, dst): ...
}}}

In my proposal:

{{{
def copy(self, dst, copystat=False): ...
}}}

It's just that I think that copyfile, copymode and copystat aren't
usually useful, and there's no reason not to unite copy and copy2.

= Other Changes =

Here is a list of the smaller things I've changed in my proposal.

The current normpath removes '..' with the name before them. I didn't
do that, because it doesn't return an equivalent path if the path
before the '..' is a symbolic link.

I removed the methods associated with file extensions. I don't recall
using them, and since they're purely textual and not OS-dependent, I
think that you can always do p[-1].rsplit('.', 1).

I removed renames. Why not use makedirs, rename, removedirs?

I removed unlink. It's an alias to remove, as far as I know.

I removed expand. There's no need to use normpath, so it's equivalent
to .expanduser().expandvars(), and I think that the explicit form is
better.

removedirs - I added another argument, basedir, which won't be removed
even if it's empty. I also allowed the first directory to be unempty
(I required that it should be a directory). This version is useful for
me.

readlinkabs - The current path class returns abspath(readlink). This
is meaningless - symbolic links are interpreted relative to the
directory they are in, not relative the the current working directory
of the program. Instead, I wrote readlinkpath, which returns the
correct path object. However, I'm not sure if it's needed - why not
use realpath()?

copytree - I removed it. In shutil it's documented as being mostly a
demonstration, and I'm not sure if it's really useful.

symlink - Instead of a function like copy, with the destination as the
second (actually, the only) argument, I wrote "writelink", which gets
a string and creates a symbolic link with that value. The reason is
that symbolic links can be any string, not necessarily a legal path.

I added mknod and mkfifo, which from some reason weren't there.

I added chdir, which I don't see why shouldn't be defined.

relpathto - I used realpath() instead of abspath(). abspath() may be
incorrect if some of the dirs are symlinks.

I removed relpath. It doesn't seem useful to me, and I think that
writing path.cwd().relpathto(p) is easy enough.

join - I decided that p+q should only work if q is a relative path. In
my first implementation, it returned q, which is consistent with the
current os.path.join(). However, I think that in the spirit of
"explicit is better than implicit", a code like
{{{
if q.isrel:
    return p + q
else:
    return q
}}}
is pretty easy and pretty clear. I think that many times you want q to
be relative, so an exception if it isn't would be helpful. I also
think that it's nice that {{{ len(p+q) == len(p) + len(q) }}}.

match - The current implementation matches the base name of the path
against a pattern. My implementation matches a relative path against a
pattern, which is also a relative path (it's of the same form as the
pattern of glob - may include '**')

matchcase - I removed it. If you see a reason for keeping it, tell me.

= Comparison to the Current Path Class =

Here's a comparison of doing things using the current path class and
doing things using my proposed path class.

{{{
# Operations on path strings:
p.cwd()        -> p.cwd()
p.abspath()    -> p.abspath()
p.normcase()   -> p.normcase
Also added p.normcasestr, to normalize path elements.
p.normpath()   -> Unneeded
p.realpath()   -> p.realpath()
p.expanduser() -> p.expanduser()
p.expandvars() -> p.expandvars()
p.basename()   -> p[-1]
p.expand()     -> p.expanduser().expandvars()
p.splitpath()  -> Unneeded
p.stripext()   -> p[-1].rsplit('.', 1)[0]
p.splitunc()   -> Unneeded
p.splitall()   -> Unneeded
p.relpath()    -> path.cwd().relpathto(p)
p.relpathto(dst) -> p.relpathto(dst)

# Properties about the path:
p.parent       -> p[:-1]
p.name         -> p[-1]
p.ext          -> ''.join(p[-1].rsplit('.', 1)[1:])
p.drive        -> p[0] if p and isinstance(p[0], path.Drive) else None
p.namebase     -> p[-1].rsplit('.', 1)[0]
p.uncshare     -> p[0] if p and isinstance(p[0], path.UNCRoot) else None

# Operations that return lists of paths:
p.listdir()    -> p.glob()
p.listdir(patt)-> p.glob(patt)
p.dirs()       -> p.glob(onlydirs=1)
p.dirs(patt)   -> p.glob(patt, onlydirs=1)
p.files()      -> p.glob(onlyfiles=1)
p.files(patt)  -> p.glob(patt, onlyfiles=1)
p.walk()       -> p.glob('**')
p.walk(patt)   -> p.glob('**/patt')
p.walkdirs()   -> p.glob('**', onlydirs=1)
p.walkdirs(patt) -> p.glob('**/patt', onlydirs=1)
p.walkfiles()  -> p.glob('**', onlyfiles=1)
p.walkfiles(patt) -> p.glob('**/patt', onlyfiles=1)
p.match(patt)  -> p[-1:].match(patt)
(The current match matches the base name. My matches a relative path)
p.matchcase(patt) -> Removed
p.glob(patt)   -> p.glob(patt)

# Methods for retrieving information about the filesystem
# path:
p.exists()     -> p.exists()
Added p.lexists()
p.isabs()      -> not p.isrel
(That's the meaning of the current isabs().)
Added p.isabs
p.isdir()      -> p.isdir()
Added p.lisdir()
p.isfile()     -> p.isfile()
Added p.lisfile()
p.islink()     -> p.islink()
p.ismount()    -> p.ismount()
p.samefile(other) -> p.samefile(other)
p.getatime()   -> p.stat().atime
p.getmtime()   -> p.stat().mtime
p.getctime()   -> p.stat().ctime
p.getsize()    -> p.stat().size
p.access(mode) -> p.access(mode)
p.stat()       -> p.stat()
p.lstat()      -> p.lstat()
p.statvfs()    -> p.statvfs()
p.pathconf(name) -> p.pathconf(name)

# Filesystem properties for path.
atime, mtime, ctime, size - Removed

# Methods for manipulating information about the filesystem
# path.
utime, chmod, chown, rename - unchanged
p.renames(new)   -> new[:-1].makedirs(); p.rename(new); p[:-1].removedirs()

# Create/delete operations on directories
mkdir, makedirs, rmdir, removedirs - unchanged (added an option to removedirs)

# Modifying operations on files
touch, remove - unchanged
unlink - removed

# Modifying operations on links
p.link(newpath)   -> p.link(newpath)
p.symlink(newlink) -> newlink.writelink(p)
p.readlink()      -> p.readlink()
p.readlinkabs()   -> p.readlinkpath()

# High-level functions from shutil
copyfile, copymode, copystat, copytree - removed
p.copy(dst)   -> p.copy(dst)
p.copy2(dst)  -> p.copt(dst, copystat=1)
move, rmtree - unchanged.

# Special stuff from os
chroot, startfile - unchanged.

}}}

= Open Issues =

Unicode - I have no idea about unicode paths. My current
implementation simply uses str. This should be changed, I guess.

Slash-terminated paths - In my current implementation, paths ending
with a slash are normalized to paths without a slash (this is also the
behaviour of os.path.normpath). However, they aren't really the same:
stat() on paths ending with a slash fails if they aren't directories,
and lstat() treats them as directories even if they are symlinks.
Perhaps a final empty string should be allowed.

= Finally =

Please say what you think, either here or on the wiki. Not every
change that I suggested must be accepted, but I would be happy if they
were considered.

I hope it proves useful.
Noam