[Python-3000] Mini Path object

Tue Nov 7 10:15:48 CET 2006

Mike Orr wrote:
> My latest idea is something like this:
> 
> #### BEGIN
> class Path(unicode):
>     """Pathname-manipulation methods."""
>     pathlib = os.path              # Subclass can specify (posix|nt|mac)path.
>     safe_args_only = False    # Glyph can set this to True in a subclass.
>

I'm a little confused here about the model of how platform-specific and 
application-specific formats are represented. Is it the case that the 
creation function converts the platform-specific path into a generic, 
universal path object, or does it create a platform-specific path object?

In the former case -- where you create a platform-agnostic "universal" 
path object - the question is, how do you then 'render' that into a 
platform-specific string format?

And in the latter case, where the 'path objects' are themselves in 
platform-specific form, how do you cross-convert from one format to another?

One thing I really want to avoid is the situation where the only 
available formats are those that are built-in to the path module.

> On 11/5/06, Talin <talin at acm.org> wrote:
>> Mike Orr wrote:
>>>     Path(  Path("directory"),   "subdirectory", "file")    # Replaces
>>> .joinpath().
>> For the constructor, I would write it as:
>>
>>    Path( *components )
> 
> Yes.  I was just trying to show what the argument objects where.
> 
>> Strings can also be wrapped with an object that indicates that the Path
>> is in a platform- or application-specific format:
>>
>>     # Explicitly indicate that the path string is in Windows NTFS format.
>>     Path( Path.format.NTFS( "C:\\Program Files" ) )
> 
> This (and all your other .format examples) sounds a bit complicated.
> The further we stray from the syntax/semantics of the existing stdlib
> modules, the harder it will be to achieve consensus, and thus the less
> chance we'll get anything into the stdlib.  So I'm for a scalable
> solution, where the lower levels are less controversial and well
> tested, and we can experiment in the higher levels.  Can you make your
> .format proposal into an optional high-level component and we'll see
> how well it works in practice?

First, ignore the syntax of my examples - its really the semantics that 
I want to get across.

I don't think you need to follow too closely the syntax of os.path - 
rather, we should concentrate on the semantics, and even more 
importantly, the scope of the existing module. In other words don't try 
to do too much more than os.path did.

>> One question to be asked is whether the path should be simplified or
>> not. There are cases where you *don't* want the path to be simplified,
>> and other cases where you do. Perhaps a keyword argument?
>>
>>     Path( "C:\\Program Files", "../../Gimp", normalize = True )
> 
> Maybe.  I'm inclined to let an .only_safe_args attribute or a SafePath
> subclass enforce normalizing, and let the main class do whatever
> os.path.join() does.

For my own code, I want to simplify 100% of the time. Having to insert 
the extra call to normpath() everywhere clutters up the code fairly 
egregiously.

The whole issue of path simplification is really an "end user debugging" 
issue. Since simplified and non-simplified paths are equivalent, the 
only time it really matters whether or not a path is simplified is when 
the end-user sees a path; And in my experience, this generally only 
happens in error messages such as "File xxx not found" or in batch 
programs that take a list of files as input. Moreover, GUI programs that 
use a file open dialog to choose files generally don't show the full 
path when an error occurs. So really we're limited to the case where you 
have some sort of app that is either script-driven or has a user-edited 
config file (like Make) in which there are fragments of paths being 
joined together.

As the designer of such an app, the question you want to ask yourself 
is, will it be easier for the end user to diagnose the error if we show 
them the simplified result of the concatenation, or the unsimplified, 
concatenated strings. In my case, I almost always prefer to be shown the 
simplified path so that I don't have to mentally simplify all of the 
../../.. bits before looking at that particular location in the 
filesystem to see what the problem is.

>>>     Path("ab") + "c"  => Path("abc")
>> Wouldn't that be:
>>
>>     Path( "ab" ) + "c" => Path( "ab", "c" )
> 
> If we want string compatibility we can't redefine the '+' operator.
> If we ditch string compatibility we can't pass Paths to functions
> expecting a string.  We can't have it both ways.  This also applies to
> character slicing vs component slicing.

If the '+' operator can't be used, then we need to specify how paths are 
joined. Or is it your intent that the only way to concatenate paths be 
via the constructor?

I'm in favor of the verb 'combine' being used to indicate both a joining 
and a simplification of a path.

>>>     .abspath()
>> I've always thought this was a strange function. To be honest, I'd
>> rather explicitly pass in the cwd().
> 
> I use it; it's convenient.  The method name could be improved.

The thing that bugs me about it is that it relies on a hidden variable, 
in this case cwd(). Although that's partly my personal experience 
speaking, since a lot of the work I do is writing library code on game 
consoles such as Xbox360 and PS3: These platforms have a filesystem, but 
no concept of a "current working directory", so if the app wants to have 
something like a cwd, it has to maintain the variable itself.

>>>     .normcase()
>>>     .normpath()
> 
> ... and other methods.  Your proposals are all in the context of "how
> much do we want to improve os.path's syntax/semantics vs keeping them
> as-is?"  I would certainly like improvement, but improvement works
> against consensus. Some people don't want anything to change beyond a
> minimal class wrapper.  Others want improvements but have differing
> views about what the best "improvements" are.  So how far do you want
> to go, and how does this impact your original question, "Is consensus
> possible?"
> 
> Then, if consensus is not possible, what do we do?  Each go into our
> corners and make our favorite incompatible module?  Or can we come up
> with a master plan containing alternatives, to bring at least some
> unity to the differing modules without cramping anybody's style.

What I'm trying to do is refactor os.path as we go - it seems to me that 
it would be a shame to simply shovel over the existing functions and 
simply make them object methods, especially (as several people have 
pointed out) since many of them can be combined into more powerful and 
more general methods.

As far as consensus goes: Consensus is not hard to get if you can get it 
piecemeal. People with nitpick at little details of the proposal 
forever, but that's not a huge obstacle to adoption. However, getting 
them to swallow one huge lump of a concept is much harder.

It seems that once you can get people to accept the basic premise of a 
path object, you are home free - the rest is niggling over details. So I 
don't see that this harms the chance of adoption, as long as people are 
allowed to tweak the details of the proposal.

> In this vein, a common utility module with back-end functions would be
> good.  Then we can solve the difficult problems *once* and have a test
> suite that proves it, and people would have confidence using any OO
> classes that are built over them.  We can start by gathering the
> existing os.*, os.path.*, and shutil.* functions, and then add
> whatever other functions our various OO classes might need.
> 
> However, due to the problem of supporting (posix|nt|mac)path, we may
> need to express this as a class of classmethods rather than a set of
> functions, so they can be defined relative to a platform library.
> 
>>>     .realpath()
>> Rename to resolve() or resolvelinks().
> 
> Good idea.
> 
>>>     .expanduser()
>>>     .expandvars()
>>>     .expand()
>> Replace with expand( user=True, vars=True )
> 
> Perhaps.  There was one guy in the discussion about Noam's path module
> who didn't like .expand() at all; he thought it did too many things
> implicitly and was thus too magical.

Well I'd say leave it in then, and see who objects.

>>>     .parent
>> If parent was a function, you could pass in the number of levels go to
>> up, i.e. parent( 2 ) to get the grandparent.
> 
> I'd like .ancestor(N) for that.  Parent as a property is nice when
> it's only one or two levels.

Reasonable

>>>     .name                 # Full filename without path
>>>     .namebase        # Filename without extension
>> I find the term 'name' ambiguous. How about:
>>
>>      .filepart
>>      .basepart
>>
>> or:
>>
>>      .filename
>>      .basename
> 
> .name/.namebase isn't great, but nothing else that's been proposed is better.

If all else fails, copy what someone else has done:

http://msdn2.microsoft.com/en-us/library/system.io.path_members(VS.80).aspx

>>>     .drive
>> Do we need to somehow unify the concept of 'drive' and 'unc' part? Maybe
>> '.device' could return the part before the first directory name.
> 
> This gets into the "root object" in Noam's proposal.  I'd say just
> read that and the discussion, and see if it approaches what you want.
> I find this another complicated and potential bog-down point, like
> .format.
> 
> http://wiki.python.org/moin/AlternativePathClass
> http://wiki.python.org/moin/AlternativePathDiscussion
> 
>>>     .splitpath()
>> I'd like to replace this with:
>>
>>     .component( slice_object )
>>
>> where the semantics of 'component' are identical to __getitem__ on an
>> array or tuple. So for example:
>>
>>     Path( "a", "b" ).component( 0 ) => "a"
>>     Path( "a", "b" ).component( 1 ) => "b"
>>     Path( "a", "b" ).component( -1 ) => "b"
>>     Path( "a", "b" ).component( 0:1 ) => Path( "a", "b" )
>>     Path( "a", "b" ).component( 1: ) => Path( "b" )
>>
>> This is essentially the same as the "slice notation" proposal given
>> earlier, except that explicitly tell the user that we are dealing with
>> path components, not characters.
> 
>     Path("a/b").components[0:1] => Path("a/b")
> 
> Is there a problem with .component returning a Path instead of a list
> of components?

I don't see why we can't have both:

    path.component[ a:b ] # returns a list of components
    path.subpath[ a:b ] # returns a (possibly non-rooted) Path object

Note also that this would mean that ".splitall()" is no longer needed, 
since it can be replaced by "path.component[:]". (More refactoring...woot!)

Another thing I like about the slice syntax is that you can also control 
whether or not the return value is a subsequence or not:

    path.component[ a ] # Gets the ath component as a string
    path.component[ a:a+1 ] # Gets a list containing the ath component

This distinction is not as easily replicated using a function-call 
syntax such as path.component( a, a+1 ). Although in the case of 
subpath, I am not sure what the distinction is.

(On the naming of 'component' vs. 'components' - my general naming 
convention is that array names are plurals - so a table of primes is 
called 'primes' not 'prime'.)

> In some ways I still like Noam's Path-as-components idea.  It
> eliminates all slicing methods, and '+' does join.  The problem is
> you'd have to explicitly unicode() it when passing it to functions
> that expect a string. I guess the advantage of Path being unicode
> still outweigh the disadvantages.
> 
> Here's one possibility for splitting the absolute/relative part of a path:
> 
>     Path("/a/b").absolute_prefix => "/"
>     relative_start_point = len(Path("/a/b").absolute_prefix)
> 
> It would have to be defined for all platforms.  Or we can have a
> splitroot method:
> 
>     Path("/a/b").splitroot()  =>  [Path("/"), Path("a/b")]
> 
> Not sure which way would be most useful overall.
> 
> 
>>>     .stripext()
>> How about:
>>
>>      path.ext = ''
> 
> The discussion in Noam's proposal has .add_exts(".tar", ".gz") and
> .del_exts(N).  Remember that any component can have extension(s), not
> just the last.  Also, it's up to the user which apparent extensions
> should be considered extensions.  How many extensions does
> "python-2.4.5-i386.2006-12-12.orig.tar.gz" have?

A directory name with a '.' in it isn't really an extension, at least 
not as interpreted by most filesystems. For example, if you create a 
folder in MSWindows, and name it "foo.bar" and then look at it in 
windows explorer, it will still say it's a folder; It won't try and 
display it as a "folder of type 'bar'". Similarly, if you are using 
LSCOLORS under posix, and you have a directory with a dot in the name, 
it still shows up in the same color as other dirs.

In any case, I really don't think we need to support any special 
accessors for accessing the part after the '.' in any component but the 
last.

As far as multiple extensions go - the easiest thing would be to simply 
treat the very last part - '.gz' in your example - as the extension, and 
let the user worry about the rest. I only know of one program - GNU tar 
- that attempts to interpret multiple file extensions in this way. (And 
you'll notice that most of the dots in the example are version number 
separators, not file extension separators.)

I'll go further out on a limb here, and say that interpreting the file 
extension really isn't the path library's job, and the only reason why 
this function is here at all is to prevent novice programmers from 
erroneously calling str.rfind( '.' ) on the path string, which will of 
course yield the wrong answer if the filename has no dot in it but a 
directory name does.

>>>     .splitall()
>> Something sadly lacking in os.path.
> 
> I thought this was what .splitpath() would do.

.splitpath only splits into two pieces, a dirname and a filename.

>>>     .relpathto()
>> Not sure what this does, since there's no argument defined.
> 
>>From Orendorff's commentary.
> "The method p1.relpathto(p2) returns a relative path to p2, starting from p1."
> http://www.jorendorff.com/articles/python/path/
> 
> I've always found it confusing to remember which one is 'from' and
> which one is 'to'?