[Python-3000] Mini Path object

Tue Nov 7 23:06:42 CET 2006

On 11/7/06, Talin <talin at acm.org> wrote:
> Mike Orr wrote:
> > My latest idea is something like this:
> >
> > #### BEGIN
> > class Path(unicode):
> >     """Pathname-manipulation methods."""
> >     pathlib = os.path              # Subclass can specify (posix|nt|mac)path.
> >     safe_args_only = False    # Glyph can set this to True in a subclass.
> >
>
> I'm a little confused here about the model of how platform-specific and
> application-specific formats are represented. Is it the case that the
> creation function converts the platform-specific path into a generic,
> universal path object, or does it create a platform-specific path object?
>
> In the former case -- where you create a platform-agnostic "universal"
> path object - the question is, how do you then 'render' that into a
> platform-specific string format?
>
> And in the latter case, where the 'path objects' are themselves in
> platform-specific form, how do you cross-convert from one format to another?
>
> One thing I really want to avoid is the situation where the only
> available formats are those that are built-in to the path module.

My Path object is platform-specific.  By default it uses the "native"
path module (os.path).  If you need non-native paths, subclass Path
and set .pathlib to the other path module (e.g., ntpath, posixpath).
To convert paths we can make the constructor smarter; e.g.,

    class NTPath(Path):
        pathlib = ntpath
    foo = Path("foo")
    my_nt_path = NTPath(foo)

The constructor might recognize that Path.pathlib != NTPath.pathlib
and call foo.to_universal() to get a universal format, which it would
then convert.

Thinking further, why not use a component tuple as a universal format,
with the first element being the root (or '' for a relative path).
This is easy to convert to/from any platform string, won't be mistaken
for a platform string, and won't give one platform an advantage over
another.  We can make a pseudo Path class for the universal format,
the only restriction being that you can't pass it directly to a
filesystem-access function.  Actually, you can't pass *any* non-native
path to a filesystem-access function anyway.

Remember that 99% of Python programmers are concerned only with native
paths.  I have never used a non-native path or multiple-platform paths
in an application.  So we need to make the native case easy and clear.
 For that reason I'd rather keep the non-native cases and conversion
code in separate classes.

> I don't think you need to follow too closely the syntax of os.path -
> rather, we should concentrate on the semantics, and even more
> importantly, the scope of the existing module. In other words don't try
> to do too much more than os.path did.

I think a layered approach is important.  Use the code in the existing
modules because it's well tested.  Add a native-path object on top of
that.  Put non-native paths and conversions on the side.  Then put a
filesystem-access class (or functions) on top of the native-path
object.  Then high-level functions/methods on top of that.  Then when
we lobby for stdlib inclusion it'll be "one level and everything below
it".  People can see and test that level, and ignore the (possibly
more controversial) levels above it.

Of course, the layers can be collapsed at a later date if desired.
And the os.path/os/shutil functions can be inlined at some point if
it's decided to deprecate them.  But that's too much for now.

I would really like to improve on os.path and os/shutil.  That's a
separate issue from the class structure I've proposed.  My only reason
for making a thin wrapper would be for greater acceptability (lowest
common denominator).  But I'd rather work on a thick wrapper whose
methods corresponds to what the application programer conceptually
wants to do, rather than having to specify every piddly step (rmtree
if a directory exists, remove if a file exists, nothing if nothing
exists).

> >> One question to be asked is whether the path should be simplified or
> >> not. There are cases where you *don't* want the path to be simplified,
> >> and other cases where you do. Perhaps a keyword argument?
> >>
> >>     Path( "C:\\Program Files", "../../Gimp", normalize = True )
> >
> > Maybe.  I'm inclined to let an .only_safe_args attribute or a SafePath
> > subclass enforce normalizing, and let the main class do whatever
> > os.path.join() does.
>
> For my own code, I want to simplify 100% of the time. Having to insert
> the extra call to normpath() everywhere clutters up the code fairly
> egregiously.

That is what I'd prefer.  Let the constructor call normpath().  Maybe
we can just say we don't support non-normalized paths, and if you need
that you can write a NonNormalizedPath subclass or use os.path
directly.

My pet peeve is paths beginning with "./" .  I want that out of error
messages and diagnostic messages!

> >>>     Path("ab") + "c"  => Path("abc")
> >> Wouldn't that be:
> >>
> >>     Path( "ab" ) + "c" => Path( "ab", "c" )
> >
> > If we want string compatibility we can't redefine the '+' operator.
> > If we ditch string compatibility we can't pass Paths to functions
> > expecting a string.  We can't have it both ways.  This also applies to
> > character slicing vs component slicing.
>
> If the '+' operator can't be used, then we need to specify how paths are
> joined. Or is it your intent that the only way to concatenate paths be
> via the constructor?
>
> I'm in favor of the verb 'combine' being used to indicate both a joining
> and a simplification of a path.

I like the syntax of a join method.  With a multi-arg constructor it's
not necessary though. PEP 355 hints at problems with the .joinpath()
method though it didn't say what they were.   .combine() is an OK
name, and there's a precedent in the datetime module.  But people are
used to calling it "joining paths" so I'm not sure we should rename it
so easily.  We can't call it .join due to string compatibility.
.joinpath is ugly if we delete "path" from the other method names.
.pjoin comes to mind but people might find it too cryptic and too
similar to .join.  By just using the constructor we avoid the debate
over the method name.

> >>>     .abspath()
> >> I've always thought this was a strange function. To be honest, I'd
> >> rather explicitly pass in the cwd().
> >
> > I use it; it's convenient.  The method name could be improved.
>
> The thing that bugs me about it is that it relies on a hidden variable,
> in this case cwd(). Although that's partly my personal experience
> speaking, since a lot of the work I do is writing library code on game
> consoles such as Xbox360 and PS3: These platforms have a filesystem, but
> no concept of a "current working directory", so if the app wants to have
> something like a cwd, it has to maintain the variable itself.

Again, 99.9% of Python users have a functioning cwd, and .abspath() is
one of those things people expect in any programming language. Nobody
is forcing you to call it on the Xbox.  However, I'm not completely
opposed to dropping .abspath().

> It seems that once you can get people to accept the basic premise of a
> path object, you are home free - the rest is niggling over details. So I
> don't see that this harms the chance of adoption, as long as people are
> allowed to tweak the details of the proposal.

The syntax details do matter.  They won't necessarily make or break a
proposal but they do make people's support for it stronger or weaker.
But flexibility is a good thing.  If we have a well-chosen structure
but make it easy for people to subclass it and rename methods if they
can't stand our choices, they might accept it anyway.  This also means
not hardwiring class relationships.  For instance, Path should create
paths via self.__class__ rather than calling Path literally, and
FSPath should call self.path_class rather than calling Path directly.
This makes it easy for somebody to write a "better" Path class and
have FSPath use it.

> > In this vein, a common utility module with back-end functions would be
> > good.  Then we can solve the difficult problems *once* and have a test
> > suite that proves it, and people would have confidence using any OO
> > classes that are built over them.  We can start by gathering the
> > existing os.*, os.path.*, and shutil.* functions, and then add
> > whatever other functions our various OO classes might need.
> >
> > However, due to the problem of supporting (posix|nt|mac)path, we may
> > need to express this as a class of classmethods rather than a set of
> > functions, so they can be defined relative to a platform library.

Actually, this may not be a problem after all.  The filesystem-access
functions are universal; e.g., there's only one os.remove().  And they
only make sense with native paths.  So we can call os.path.exists()
from our purge() or super_duper_copy() function and be happy.

>     path.component[ a:b ] # returns a list of components
>     path.subpath[ a:b ] # returns a (possibly non-rooted) Path object

Great.  I was wondering how to handle the two cases.

    path.components[a:b]    =>  list of components
    path.components[a]        =>  one component
    path.subpath(a, b)          =>  a path   (relative if a != 0)

Now if

     Path("/usr/bin/python").components  => ["/", "usr", "bin", "python"]

then we have the universal format I described above.  That's two birds
down with one stone.

I think .subpath should be a method though.  I can't think of another
case where slicing returns a non-sequence.  If a defaults to 0 and b
to -1, it'll be easy for people to get the beginning or end of the
path using normal Python subscripts.  Or we can even allow None for
the endpoinds to make it super easy.

>     path.component[ a ] # Gets the ath component as a string
>     path.component[ a:a+1 ] # Gets a list containing the ath component
>
> This distinction is not as easily replicated using a function-call
> syntax such as path.component( a, a+1 ). Although in the case of
> subpath, I am not sure what the distinction is.

There is no distinction.  It's not a sequence in either case.  (Or at
least it's not *that* kind of sequence.  It's a subclass of unicode so
it's a sequence of characters.)

> (On the naming of 'component' vs. 'components' - my general naming
> convention is that array names are plurals - so a table of primes is
> called 'primes' not 'prime'.)

So we agree that .components is better, if I understand you.

> > The discussion in Noam's proposal has .add_exts(".tar", ".gz") and
> > .del_exts(N).  Remember that any component can have extension(s), not
> > just the last.  Also, it's up to the user which apparent extensions
> > should be considered extensions.  How many extensions does
> > "python-2.4.5-i386.2006-12-12.orig.tar.gz" have?
>
> A directory name with a '.' in it isn't really an extension, at least
> not as interpreted by most filesystems. For example, if you create a
> folder in MSWindows, and name it "foo.bar" and then look at it in
> windows explorer, it will still say it's a folder; It won't try and
> display it as a "folder of type 'bar'". Similarly, if you are using
> LSCOLORS under posix, and you have a directory with a dot in the name,
> it still shows up in the same color as other dirs.
>
> In any case, I really don't think we need to support any special
> accessors for accessing the part after the '.' in any component but the
> last.
>
> As far as multiple extensions go - the easiest thing would be to simply
> treat the very last part - '.gz' in your example - as the extension, and
> let the user worry about the rest. I only know of one program - GNU tar
> - that attempts to interpret multiple file extensions in this way. (And
> you'll notice that most of the dots in the example are version number
> separators, not file extension separators.)
>
> I'll go further out on a limb here, and say that interpreting the file
> extension really isn't the path library's job, and the only reason why
> this function is here at all is to prevent novice programmers from
> erroneously calling str.rfind( '.' ) on the path string, which will of
> course yield the wrong answer if the filename has no dot in it but a
> directory name does.

People need to add/delete/replace extensions, and they don't want to
use character slicing / .rfind / .endswith / len() / + to do it.  They
expect the library to at least handle extension splitting as well as
os.path does.  Adding a few convenience methods would be unobtrusive
and express people really want to do:

    p2 = p.add_ext(".tar")
    p2 = p.del_ext()
    p2 = Path("foo.gzip").replace_ext(".bz2")

But what harm is there in making them scalable to multiple extensions?

    .add_exts(*exts)
    .del_exts(N)
    .replace_exts(N, *exts)

You're right that directory extensions are rare.  Maybe we should just
support extensions on the last path component.

-- 
Mike Orr <sluggoster at gmail.com>