Mailman 3 os.path.commonprefix: Yes that old chestnut. - Python-ideas

os.path.commonprefix: Yes that old chestnut.

Paddy3118

21 Mar 2015 21 Mar '15

12:41 a.m.

I had to add a comment to an RC entry http://rosettacode.org/wiki/Longest_common_prefix#Pythonthat was using os.path.commonprefix to compute the longest common prefix which it does regardless of path separators, (which is what it should do). It has been discussed before http://bugs.python.org/issue10395- it goes back to the nineties it seems, but we are still left with a misplaced function, (it should be str.commonprefix), with an awful hack of saying "yes we know its ballsed up" in the documentation https://docs.python.org/3.5/library/os.path.html?highlight=commonprefix#os.p... . I guess we missed a great opportunity to fix this when we moved to Python 3 too!? The fix seems clear: deprecate os.path.commonprefix whilst creating a true os.path.commonpath and str.commonprefix. The deprecated function should hang around and the standard libs modified to switch to the new function(s) I've heard that some religious people put obvious faults in their work as only their god should get so close to perfection - maybe that's why we still have this wart :-)

Attachments:

attachment.htm (text/html — 1.2 KB)

Show replies by date

Antoine Pitrou

21 Mar 21 Mar

7:47 a.m.

On Fri, 20 Mar 2015 21:41:03 -0700 (PDT) Paddy3118 wrote:

...

The fix seems clear: deprecate os.path.commonprefix whilst creating a true os.path.commonpath and str.commonprefix. The deprecated function should hang around and the standard libs modified to switch to the new function(s)

+1 from me.

...

I've heard that some religious people put obvious faults in their work as only their god should get so close to perfection - maybe that's why we still have this wart :-)

If by chance their god is Guido, then at least /he/ can intervene to fix his disciples' mess ;-) Regards Antoine.

Tal Einat

23 Mar 23 Mar

10:52 a.m.

On Sat, Mar 21, 2015 at 1:47 PM, Antoine Pitrou wrote:

...

On Fri, 20 Mar 2015 21:41:03 -0700 (PDT) Paddy3118 wrote:

...
The fix seems clear: deprecate os.path.commonprefix whilst creating a true os.path.commonpath and str.commonprefix. The deprecated function should hang around and the standard libs modified to switch to the new function(s)

+1 from me.

Paul Moore

11:24 a.m.

On 23 March 2015 at 14:52, Tal Einat wrote:

...

On Sat, Mar 21, 2015 at 1:47 PM, Antoine Pitrou wrote:

...
On Fri, 20 Mar 2015 21:41:03 -0700 (PDT) Paddy3118 wrote:

...
The fix seems clear: deprecate os.path.commonprefix whilst creating a true os.path.commonpath and str.commonprefix. The deprecated function should hang around and the standard libs modified to switch to the new function(s)

+1 from me.

+1

+1. Maybe adding a commonprefix operation to pathlib would be a good idea as well - using that would avoid the confusion between the deprecated os.path.commonprefix and os.path.commonpath... Paul

Gregory P. Smith

5:33 p.m.

+1 pathlib would be the appropriate place for the correctly behaving function to appear. os.path.commonprefix()'s behavior would be better off as a str and bytes method rather than a crazy function in the os.path library but I doubt anyone *really* wants to add more methods to those. On Mon, Mar 23, 2015 at 8:25 AM Paul Moore wrote:

...

...
On Sat, Mar 21, 2015 at 1:47 PM, Antoine Pitrou wrote:

...
On Fri, 20 Mar 2015 21:41:03 -0700 (PDT) Paddy3118 wrote:

...
The fix seems clear: deprecate os.path.commonprefix whilst creating a

On 23 March 2015 at 14:52, Tal Einat wrote: true

...
...
...
os.path.commonpath and str.commonprefix. The deprecated function should hang around and the standard libs modified to switch to the new function(s)

+1 from me.

+1

+1. Maybe adding a commonprefix operation to pathlib would be a good idea as well - using that would avoid the confusion between the deprecated os.path.commonprefix and os.path.commonpath...

Paul _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Antoine Pitrou

6:09 p.m.

On Mon, 23 Mar 2015 21:33:48 +0000 "Gregory P. Smith" wrote:

...

+1 pathlib would be the appropriate place for the correctly behaving function to appear.

Patches welcome :-) Regards Antoine.

...

os.path.commonprefix()'s behavior would be better off as a str and bytes method rather than a crazy function in the os.path library but I doubt anyone *really* wants to add more methods to those.

On Mon, Mar 23, 2015 at 8:25 AM Paul Moore wrote:

...
...
On Sat, Mar 21, 2015 at 1:47 PM, Antoine Pitrou wrote:

...
On Fri, 20 Mar 2015 21:41:03 -0700 (PDT) Paddy3118 wrote:

...
The fix seems clear: deprecate os.path.commonprefix whilst creating a

On 23 March 2015 at 14:52, Tal Einat wrote: true

...
...
...
os.path.commonpath and str.commonprefix. The deprecated function should hang around and the standard libs modified to switch to the new function(s)

+1 from me.

+1

+1. Maybe adding a commonprefix operation to pathlib would be a good idea as well - using that would avoid the confusion between the deprecated os.path.commonprefix and os.path.commonpath...

Paul _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Paul Moore

6:48 p.m.

On 23 March 2015 at 22:09, Antoine Pitrou wrote:

...

On Mon, 23 Mar 2015 21:33:48 +0000 "Gregory P. Smith" wrote:

...
+1 pathlib would be the appropriate place for the correctly behaving function to appear.

Patches welcome :-)

I'll see what I can do. Basically it's just def commonprefix(p1, p2): cp = [] for (pp1, pp2) in zip(p1.parts, p2.parts): if pp1 != pp2: break cp.append(pp1) return pathlib.Path(*cp) (extended to an arbitrary number of args) but getting the corner cases right is a little more tricky. If there is no common prefix, I guess returning None is correct - it's the nearest equivalent to os.path.commonprefix returning ''. The type of the return value should be a concrete type - probably type(p1)? Maybe this should be a method on Path objects - p.commonprefix(p1, p2, ...) although it seems a bit odd that the first argument (self) is treated differently in the signature. Anyone got any preferences? Paul

Paul Moore

6:56 p.m.

On 23 March 2015 at 22:48, Paul Moore wrote:

...

I'll see what I can do. Basically it's just

More specifically, as a pathlib.Path method: def commonprefix(self, *rest): cp = [] for p0, *ps in zip(self.parts, *[pp.parts for pp in rest]): if any(p0 != pi for pi in ps): break cp.append(p0) if not cp: return None return type(self)(*cp) Paul

random832＠fastmail.us

9:54 p.m.

On Mon, Mar 23, 2015, at 18:48, Paul Moore wrote:

...

The type of the return value should be a concrete type - probably type(p1)?

I'd argue it should be the common supertype, so a PurePosixPath if both are posix and one is pure, a Path or PurePath if one is windows and the other is posix.

Paul Moore

24 Mar 24 Mar

6:05 a.m.

On 24 March 2015 at 01:54, wrote:

...

On Mon, Mar 23, 2015, at 18:48, Paul Moore wrote:

...
The type of the return value should be a concrete type - probably type(p1)?

I'd argue it should be the common supertype, so a PurePosixPath if both are posix and one is pure, a Path or PurePath if one is windows and the other is posix.

That's not really possible in the face of the possibility that an argument could be a user-defined class, possibly not even a pathlib.Path subclass (given duck typing). That's a clear benefit of a Path method, actually - the type of the return value is easy to specify - it's the type of self. Actually, in many ways, this is really a list (sequence) method - common_prefix - applied to the "parts" property of a Path. It's a shame there isn't a sequence utils module in the stdlib... One thing my implementation doesn't (yet) handle is case sensitivity. The common prefix of WindowsPath('c:\\FOO\\bar') and WindowsPath('C:\\Foo\\BAR') should be WindowsPath('C:\\Foo'). But not for PosixPath. (And again, when they are mixed, which is silly but possible, what behaviour should apply? "Work like self" is the obvious answer if we have a method). Paul

random832＠fastmail.us

9:36 a.m.

On Tue, Mar 24, 2015, at 06:05, Paul Moore wrote:

...

One thing my implementation doesn't (yet) handle is case sensitivity. The common prefix of WindowsPath('c:\\FOO\\bar') and WindowsPath('C:\\Foo\\BAR') should be WindowsPath('C:\\Foo'). But not for PosixPath. (And again, when they are mixed, which is silly but possible, what behaviour should apply? "Work like self" is the obvious answer if we have a method).

So, speaking of windows path oddities... what should be done for paths where one has a drive and the other does not? No common prefix? Interpret as the current drive? Do the former for pure paths and the latter for concrete paths? Pass in an option? Keeping in mind that people may write "portable" code but put in no effort to specifically support windows.

...

And again, when they are mixed, which is silly but possible, what behaviour should apply

For concrete paths, using the actual names of the files if they exist would be another option.

Paul Moore

9:43 a.m.

On 24 March 2015 at 13:36, wrote:

...

On Tue, Mar 24, 2015, at 06:05, Paul Moore wrote:

...
One thing my implementation doesn't (yet) handle is case sensitivity. The common prefix of WindowsPath('c:\\FOO\\bar') and WindowsPath('C:\\Foo\\BAR') should be WindowsPath('C:\\Foo'). But not for PosixPath. (And again, when they are mixed, which is silly but possible, what behaviour should apply? "Work like self" is the obvious answer if we have a method).

So, speaking of windows path oddities... what should be done for paths where one has a drive and the other does not? No common prefix? Interpret as the current drive? Do the former for pure paths and the latter for concrete paths? Pass in an option?

It's worth pointing out that all of these edge cases only occur for relative paths. It's quite possible that the only actual use cases would be completely fine if the operation was only defined on absolute paths. My instinct says you do this *purely* based on the common prefix of path.parts. The behaviour is then easy to define, and if it's not precisely what someone wants they can convert the paths to absolute and do the operations with that. I'd have to see real use cases to justify anything else.

...

Keeping in mind that people may write "portable" code but put in no effort to specifically support windows.

Precisely :-)

...

...
And again, when they are mixed, which is silly but possible, what behaviour should apply

For concrete paths, using the actual names of the files if they exist would be another option.

You can't have mixed concrete paths, you can only ever instantiate one of PosixPath or WindowsPath on a given system, AIUI. Paul

Andrew Barnert

9:44 a.m.

On Mar 24, 2015, at 3:05 AM, Paul Moore wrote:

...

...
On 24 March 2015 at 01:54, wrote:

...
On Mon, Mar 23, 2015, at 18:48, Paul Moore wrote: The type of the return value should be a concrete type - probably type(p1)?

I'd argue it should be the common supertype, so a PurePosixPath if both are posix and one is pure, a Path or PurePath if one is windows and the other is posix.

That's not really possible in the face of the possibility that an argument could be a user-defined class, possibly not even a pathlib.Path subclass (given duck typing). That's a clear benefit of a Path method, actually - the type of the return value is easy to specify - it's the type of self.

Actually, in many ways, this is really a list (sequence) method - common_prefix - applied to the "parts" property of a Path. It's a shame there isn't a sequence utils module in the stdlib...

That's a good point. But do you really care that the result is a list (actually, isn't parts a tuple, not a list?), or just that it's some kind of iterable--or, even more generally, something you can make a Path object out of? Because there _is_ an iterable utils module in the stdlib, and I think the implementation is simpler if you think of it that way too: def common_prefix(x: Iterable[X], y: Iterable[X]) -> Iterator[X]: for a, b in zip(x, y): if a != b: return yield a (Or, if you prefer, implement it as a chain of zip, takewhile, and map(itemgetter) then yield from the result.) If you as a user want to turn that back into a tuple, you can, but normally you're just going to want to join them back up into a Path (or a type(p1)) without bothering with that.

...

One thing my implementation doesn't (yet) handle is case sensitivity. The common prefix of WindowsPath('c:\\FOO\\bar') and WindowsPath('C:\\Foo\\BAR') should be WindowsPath('C:\\Foo').

Not 'c:\\FOO'? I'd expect the left one to win--especially if it's a method, so the left one is self.

...

But not for PosixPath. (And again, when they are mixed, which is silly but possible, what behaviour should apply? "Work like self" is the obvious answer if we have a method).

Needless to say, an itertools (or "sequencetools") function that you call on parts does nothing to either help or hinder this problem. But it does seem to lend itself better to approaches where parts holds some new FooPathComponent type, or maybe a str on POSIX but a new CaseInsensitiveStr on Windows.

Paul Moore

9:51 a.m.

On 24 March 2015 at 13:44, Andrew Barnert wrote:

...

...
Actually, in many ways, this is really a list (sequence) method - common_prefix - applied to the "parts" property of a Path. It's a shame there isn't a sequence utils module in the stdlib...

That's a good point. But do you really care that the result is a list (actually, isn't parts a tuple, not a list?), or just that it's some kind of iterable--or, even more generally, something you can make a Path object out of? Because there _is_ an iterable utils module in the stdlib, and I think the implementation is simpler if you think of it that way too:

def common_prefix(x: Iterable[X], y: Iterable[X]) -> Iterator[X]: for a, b in zip(x, y): if a != b: return yield a

(Or, if you prefer, implement it as a chain of zip, takewhile, and map(itemgetter) then yield from the result.)

If you as a user want to turn that back into a tuple, you can, but normally you're just going to want to join them back up into a Path (or a type(p1)) without bothering with that.

I was thinking that there might be a reason it wouldn't work for arbitrary iterators, so you'd need at least a Sequence. But you're right, an itertool is sufficient. Although given the itertools focus on building blocks, it may end up being simply a recipe.

...

...
One thing my implementation doesn't (yet) handle is case sensitivity. The common prefix of WindowsPath('c:\\FOO\\bar') and WindowsPath('C:\\Foo\\BAR') should be WindowsPath('C:\\Foo').

Not 'c:\\FOO'? I'd expect the left one to win--especially if it's a method, so the left one is self.

Technically, WindowsPath('C:\\FOO') and WindowsPath('C:\\Foo') are the same, so I stand by what I said :-) But yeah, the natural implementation would give you the relevant part of self.

...

...
But not for PosixPath. (And again, when they are mixed, which is silly but possible, what behaviour should apply? "Work like self" is the obvious answer if we have a method).

Needless to say, an itertools (or "sequencetools") function that you call on parts does nothing to either help or hinder this problem. But it does seem to lend itself better to approaches where parts holds some new FooPathComponent type, or maybe a str on POSIX but a new CaseInsensitiveStr on Windows.

(p.__class__(pp) for pp in p.parts) (See the thread about parts containing strings - technically using Path objects is dodgy, but as long as you don't leak the working values out of your function it's perfectly adequate). Paul

Andrew Barnert

10:07 a.m.

On Mar 24, 2015, at 6:51 AM, Paul Moore wrote:

...

On 24 March 2015 at 13:44, Andrew Barnert wrote:

...
...
Actually, in many ways, this is really a list (sequence) method - common_prefix - applied to the "parts" property of a Path. It's a shame there isn't a sequence utils module in the stdlib...

That's a good point. But do you really care that the result is a list (actually, isn't parts a tuple, not a list?), or just that it's some kind of iterable--or, even more generally, something you can make a Path object out of? Because there _is_ an iterable utils module in the stdlib, and I think the implementation is simpler if you think of it that way too:

def common_prefix(x: Iterable[X], y: Iterable[X]) -> Iterator[X]: for a, b in zip(x, y): if a != b: return yield a

(Or, if you prefer, implement it as a chain of zip, takewhile, and map(itemgetter) then yield from the result.)

If you as a user want to turn that back into a tuple, you can, but normally you're just going to want to join them back up into a Path (or a type(p1)) without bothering with that.

I was thinking that there might be a reason it wouldn't work for arbitrary iterators, so you'd need at least a Sequence. But you're right, an itertool is sufficient. Although given the itertools focus on building blocks, it may end up being simply a recipe.

...
...
One thing my implementation doesn't (yet) handle is case sensitivity. The common prefix of WindowsPath('c:\\FOO\\bar') and WindowsPath('C:\\Foo\\BAR') should be WindowsPath('C:\\Foo').

Not 'c:\\FOO'? I'd expect the left one to win--especially if it's a method, so the left one is self.

Technically, WindowsPath('C:\\FOO') and WindowsPath('C:\\Foo') are the same, so I stand by what I said :-) But yeah, the natural implementation would give you the relevant part of self.

...
...
But not for PosixPath. (And again, when they are mixed, which is silly but possible, what behaviour should apply? "Work like self" is the obvious answer if we have a method).

Needless to say, an itertools (or "sequencetools") function that you call on parts does nothing to either help or hinder this problem. But it does seem to lend itself better to approaches where parts holds some new FooPathComponent type, or maybe a str on POSIX but a new CaseInsensitiveStr on Windows.

(p.__class__(pp) for pp in p.parts)

Sure, but then your whole expression looks something like: p1.__class__(*more_itertools.common_prefix( (p1.__class__(pp) for pp in p1.parts), (p2.__class__(pp) for pp in p2.parts))) Which doesn't read quite as nicely as "just call an itertools function on the parts and construct a Path from them" sounds like it should. Which implies that you'd probably want at least a recipe in the pathlib docs that referenced the recipe in the itertools docs or something. And that many people who aren't on Windows just wouldn't bother and would write something non-portable until they got a complaint from a Windows user and found it worth investigating... (While we're at it: most POSIX OS's can handle both case-sensitive and case-insensitive filesystems, and at least some OS X functions take that into account, although that may not be true at the BSD level, only at the POSIX level. For that matter, doesn't the HFS+ filesystem also consider two paths equal if they have the same NFKD, even if they have different code points? But I guess if I'm remembering right, this would be no more or less broken than any other use of PosixPath on Mac, so it's not worth worrying about here, right?)

...

(See the thread about parts containing strings - technically using Path objects is dodgy, but as long as you don't leak the working values out of your function it's perfectly adequate).

Yeah, I agree that it's safe here even though it isn't safe in general.

Paul Moore

10:16 a.m.

On 24 March 2015 at 14:07, Andrew Barnert wrote:

...

...
...
Needless to say, an itertools (or "sequencetools") function that you call on parts does nothing to either help or hinder this problem. But it does seem to lend itself better to approaches where parts holds some new FooPathComponent type, or maybe a str on POSIX but a new CaseInsensitiveStr on Windows.

(p.__class__(pp) for pp in p.parts)

Sure, but then your whole expression looks something like:

p1.__class__(*more_itertools.common_prefix( (p1.__class__(pp) for pp in p1.parts), (p2.__class__(pp) for pp in p2.parts)))

Which doesn't read quite as nicely as "just call an itertools function on the parts and construct a Path from them" sounds like it should.

Agreed, absolutely :-)

...

Which implies that you'd probably want at least a recipe in the pathlib docs that referenced the recipe in the itertools docs or something.

And that many people who aren't on Windows just wouldn't bother and would write something non-portable until they got a complaint from a Windows user and found it worth investigating...

I still think it's worth considering as a path object method, for convenience. I just think the *semantics* should be that of the equivalent list operation, because that's easily understandable. But to an extent, that's the trap that os.path.commonprefix fell into (although to a much worse level...) Hence my concern that we find some real use cases for the operation, so that we can ensure that the list semantics match what people actually want the operation to do.

...

(While we're at it: most POSIX OS's can handle both case-sensitive and case-insensitive filesystems, and at least some OS X functions take that into account, although that may not be true at the BSD level, only at the POSIX level. For that matter, doesn't the HFS+ filesystem also consider two paths equal if they have the same NFKD, even if they have different code points? But I guess if I'm remembering right, this would be no more or less broken than any other use of PosixPath on Mac, so it's not worth worrying about here, right?)

This (and the decision to treat a/b/ and a/b as the same) are decisions that have already been made, for better or worse, by pathlib, and I have no intention of getting sucked into them here. IMO, a common_prefix method on pathlib objects should follow pathlib semantics, and can easily do that by being built on top of existing pathlib operations and uncontroversial list operations. That's the only approach that I think makes sense - and the only remaining question is whether it meets a real-world requirement, or whether it will end up being an oddity that no-one ever uses because it doesn't *quite* do what they want. Paul

Paul Moore

7:56 a.m.

On 23 March 2015 at 21:33, Gregory P. Smith wrote:

...

+1 pathlib would be the appropriate place for the correctly behaving function to appear.

OK, so here's a question. What actual use cases exist for a common_prefix function? The reason I ask is that I'm looking at some of the edge cases, and the obvious behaviour isn't particularly clear to me. For example, common_prefix('a/b/file.c', 'a/b/file.c'). The common prefix is obviously 'a/b/file.c' - but I can imagine people *actually* wanting the common *directory* containing both files. But taken literally, that's only possible if you check the filesystem, so it would no longer be a PurePath operation. And what about common_prefix('foo/bar', '../here/foo')? Or common_prefix('bar/baz', 'foo/../bar/baz')? Pathlib avoids collapsing .. because the meaning could change in the face of symlinks. I believe the same applies here. Maybe you need to call resolve() before doing the common prefix operation (but that gives an absolute path). With the above limitations, would a common_prefix function actually help typical use cases? In my experience, doing list operations on pathobj.parts is often simple enough that I don't need specialised functions like common_prefix... Getting the edge cases right is fiddly enough for common_prefix that a specialised function is a reasonable idea, but only if the "obvious" behaviour is clear. If there's a lot of conflicting possibilities, maybe a recipe in the docs would be a better option. Paul

random832＠fastmail.us

9:41 a.m.

On Tue, Mar 24, 2015, at 07:56, Paul Moore wrote:

...

For example, common_prefix('a/b/file.c', 'a/b/file.c'). The common prefix is obviously 'a/b/file.c' - but I can imagine people *actually* wanting the common *directory* containing both files. But taken literally, that's only possible if you check the filesystem, so it would no longer be a PurePath operation.

Or you could _always_ reject the last component. That is, even if "file.c" is a directory, return the directory it is in rather than itself. Does parts differentiate between "a/b/c" and "a/b/c/"? There are certainly contexts where real filesystems differentiate between them (symlinks, for example).

Paul Moore

9:45 a.m.

On 24 March 2015 at 13:41, wrote:

...

On Tue, Mar 24, 2015, at 07:56, Paul Moore wrote:

...
For example, common_prefix('a/b/file.c', 'a/b/file.c'). The common prefix is obviously 'a/b/file.c' - but I can imagine people *actually* wanting the common *directory* containing both files. But taken literally, that's only possible if you check the filesystem, so it would no longer be a PurePath operation.

Or you could _always_ reject the last component. That is, even if "file.c" is a directory, return the directory it is in rather than itself.

I'm pretty sure there would be cases where that's not right. For example, commonprefix('a/b', 'a/b/file.c') would be 'a'. That seems wrong.

...

Does parts differentiate between "a/b/c" and "a/b/c/"? There are certainly contexts where real filesystems differentiate between them (symlinks, for example).

No, it doesn't:

...

...
...
pathlib.Path('a/b/').parts ('a', 'b') pathlib.Path('a/b').parts ('a', 'b')

Paul

Andrew Barnert

9:55 a.m.

On Mar 24, 2015, at 4:56 AM, Paul Moore wrote:

...

...
On 23 March 2015 at 21:33, Gregory P. Smith wrote: +1 pathlib would be the appropriate place for the correctly behaving function to appear.

OK, so here's a question. What actual use cases exist for a common_prefix function? The reason I ask is that I'm looking at some of the edge cases, and the obvious behaviour isn't particularly clear to me.

For example, common_prefix('a/b/file.c', 'a/b/file.c'). The common prefix is obviously 'a/b/file.c' - but I can imagine people *actually* wanting the common *directory* containing both files. But taken literally, that's only possible if you check the filesystem, so it would no longer be a PurePath operation.

The traditional way to handle this is that the basename (the part after the last '/') is assumed to be a file (if you don't want that, include the trailing slash). POSIX even defines the technical term "path prefix" to mean everything up to the last slash, so something called a "common path prefix" sounds like it should be the common prefix of the path prefixes, right? Except that not command and function in POSIX works this way, requiring you to memorize or look up the man page to see what someone chose as "obvious" back in the 1970s.... At any rate, we probably don't need to figure this out from first principles; I'm pretty sure some subset of {Java, Boost, Cocoa, .NET, JUCE, one overwhelming popular CPAN library, etc.} have already come up with an answer, and if most of them agree, we probably want to follow suit (even if it seems silly).

...

And what about common_prefix('foo/bar', '../here/foo')? Or common_prefix('bar/baz', 'foo/../bar/baz')? Pathlib avoids collapsing .. because the meaning could change in the face of symlinks. I believe the same applies here. Maybe you need to call resolve() before doing the common prefix operation (but that gives an absolute path).

With the above limitations, would a common_prefix function actually help typical use cases? In my experience, doing list operations on pathobj.parts is often simple enough that I don't need specialised functions like common_prefix...

Getting the edge cases right is fiddly enough for common_prefix that a specialised function is a reasonable idea, but only if the "obvious" behaviour is clear. If there's a lot of conflicting possibilities, maybe a recipe in the docs would be a better option.

Paul _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Andrew Barnert

10:44 a.m.

On Mar 24, 2015, at 6:55 AM, Andrew Barnert wrote:

...

...
On Mar 24, 2015, at 4:56 AM, Paul Moore wrote:

...
On 23 March 2015 at 21:33, Gregory P. Smith wrote: +1 pathlib would be the appropriate place for the correctly behaving function to appear.

OK, so here's a question. What actual use cases exist for a common_prefix function? The reason I ask is that I'm looking at some of the edge cases, and the obvious behaviour isn't particularly clear to me.

For example, common_prefix('a/b/file.c', 'a/b/file.c'). The common prefix is obviously 'a/b/file.c' - but I can imagine people *actually* wanting the common *directory* containing both files. But taken literally, that's only possible if you check the filesystem, so it would no longer be a PurePath operation.

The traditional way to handle this is that the basename (the part after the last '/') is assumed to be a file (if you don't want that, include the trailing slash). POSIX even defines the technical term "path prefix" to mean everything up to the last slash, so something called a "common path prefix" sounds like it should be the common prefix of the path prefixes, right? Except that not command and function in POSIX works this way, requiring you to memorize or look up the man page to see what someone chose as "obvious" back in the 1970s....

Sorry, forgot to fill in the cite. See 3.2.69 at http://pubs.opengroup.org/stage7tc1/basedefs/V1_chap03.html for the 2008 definition of "path prefix".

...

At any rate, we probably don't need to figure this out from first principles; I'm pretty sure some subset of {Java, Boost, Cocoa, .NET, JUCE, one overwhelming popular CPAN library, etc.} have already come up with an answer, and if most of them agree, we probably want to follow suit (even if it seems silly).

From a quick search, it looks like many other languages don't define a common prefix path method/function, but many do define a generic iterable common-prefix function (or a first-mismatch function and random-access iterables so you can easily build one trivially, as in C++). This implies that the obvious solution in most languages will include the basename, not skip it. And it looks like http://rosettacode.org/wiki/Find_common_directory_path agrees with that. From my other reply, looking over the functions used in some of the rosettacode examples, it looks like the generic iterable function, when it exists, and the language makes it feasible, often handles an arbitrary number of arguments, not just two. Which makes sense, now that I think about it. So: def common_prefix(*iterables): for first, *rest in zip(*iterables): if any(first != part for part in rest): return yield first And of course to wrap it up: def common_path_prefix(*paths): return paths[0].__class__( *(common_prefix(*((path.__class__(p) for p in path.parts) for path in paths))) Except I'll bet I got the parens wrong somewhere; it's probably clearer as: def common_path_prefix(*paths): parts = (map(path.__class__, path.parts) for path in paths) prefix = common_prefix(*parts) return paths[0].__class__(*prefix)

...

...
And what about common_prefix('foo/bar', '../here/foo')? Or common_prefix('bar/baz', 'foo/../bar/baz')? Pathlib avoids collapsing .. because the meaning could change in the face of symlinks. I believe the same applies here. Maybe you need to call resolve() before doing the common prefix operation (but that gives an absolute path).

With the above limitations, would a common_prefix function actually help typical use cases? In my experience, doing list operations on pathobj.parts is often simple enough that I don't need specialised functions like common_prefix...

Getting the edge cases right is fiddly enough for common_prefix that a specialised function is a reasonable idea, but only if the "obvious" behaviour is clear. If there's a lot of conflicting possibilities, maybe a recipe in the docs would be a better option.

Paul _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Paul Moore

11:15 a.m.

On 24 March 2015 at 14:44, Andrew Barnert wrote:

...

From my other reply, looking over the functions used in some of the rosettacode examples, it looks like the generic iterable function, when it exists, and the language makes it feasible, often handles an arbitrary number of arguments, not just two. Which makes sense, now that I think about it. So:

def common_prefix(*iterables): for first, *rest in zip(*iterables): if any(first != part for part in rest): return yield first

Thanks for the research. This is probably something that could be included as a recipe in the itertools documentation. (I doubt it would be viewed as a common enough requirement to be added to the module itself...) Do you have any objection to me submitting a doc patch with this code? (It's similar to what I came up with, but a lot cleaner). Paul

Andrew Barnert

11:32 a.m.

On Mar 24, 2015, at 8:15 AM, Paul Moore wrote:

...

...
On 24 March 2015 at 14:44, Andrew Barnert wrote: From my other reply, looking over the functions used in some of the rosettacode examples, it looks like the generic iterable function, when it exists, and the language makes it feasible, often handles an arbitrary number of arguments, not just two. Which makes sense, now that I think about it. So:

def common_prefix(*iterables): for first, *rest in zip(*iterables): if any(first != part for part in rest): return yield first

Thanks for the research. This is probably something that could be included as a recipe in the itertools documentation. (I doubt it would be viewed as a common enough requirement to be added to the module itself...) Do you have any objection to me submitting a doc patch with this code? (It's similar to what I came up with, but a lot cleaner).

I agree that it belongs as a recipe in itertools (and of course it can go into the third-party more-itertools library on PyPI for people who want to just import and use it, as with the other useful recipes). But you probably want to test it first, and make sure I thought through the stupid edge cases like no iterables or empty iterables, since that's just off the top of my head, and while tired and recovering from a cold to boot. :) Also, while I think this generic function is useful on its own merits, it doesn't solve the problem that started this thread--as pointed out by (I forget who), just using it on the string parts gives the wrong answer for Windows paths. Should there also be a recipe in the pathlib docs referencing the itertools recipe? (Is it even reasonable for a recipe in one module's docs to rely on a recipe in another's?) Or maybe, as Tal (I think) suggested, a better solution is to avoid the problem and use parents instead of parts, meaning you don't even need a common_prefix recipe, but a last_initial_match (although that's just last(common_prefix(*args)), so maybe that doesn't really matter.)

Paul Moore

11:45 a.m.

On 24 March 2015 at 15:32, Andrew Barnert wrote:

...

But you probably want to test it first, and make sure I thought through the stupid edge cases like no iterables or empty iterables, since that's just off the top of my head, and while tired and recovering from a cold to boot. :)

lol Understood. It's still cleaner than my one :-)

...

Also, while I think this generic function is useful on its own merits, it doesn't solve the problem that started this thread--as pointed out by (I forget who), just using it on the string parts gives the wrong answer for Windows paths. Should there also be a recipe in the pathlib docs referencing the itertools recipe? (Is it even reasonable for a recipe in one module's docs to rely on a recipe in another's?) Or maybe, as Tal (I think) suggested, a better solution is to avoid the problem and use parents instead of parts, meaning you don't even need a common_prefix recipe, but a last_initial_match (although that's just last(common_prefix(*args)), so maybe that doesn't really matter.)

I still think there's a case for a pathlib.PurePath.common_prefix() method, if we get consensus on the behaviour, so if we did that, I'd probably grab the itertools recipe and use it as a private function in pathlib. It feels mildly clumsy to have a private implementation of an itertools recipe in a different stdlib module, but I don't think it's a big enough deal to justify pushing for the function to be actually added to itertools (which would involve rewriting in C, for a start). Paul

Serhiy Storchaka

2:01 p.m.

On 24.03.15 13:56, Paul Moore wrote:

...

On 23 March 2015 at 21:33, Gregory P. Smith wrote:

...
+1 pathlib would be the appropriate place for the correctly behaving function to appear.

OK, so here's a question. What actual use cases exist for a common_prefix function? The reason I ask is that I'm looking at some of the edge cases, and the obvious behaviour isn't particularly clear to me.

For example, common_prefix('a/b/file.c', 'a/b/file.c'). The common prefix is obviously 'a/b/file.c' - but I can imagine people *actually* wanting the common *directory* containing both files. But taken literally, that's only possible if you check the filesystem, so it would no longer be a PurePath operation.

And what about common_prefix('foo/bar', '../here/foo')? Or common_prefix('bar/baz', 'foo/../bar/baz')? Pathlib avoids collapsing ... because the meaning could change in the face of symlinks. I believe the same applies here. Maybe you need to call resolve() before doing the common prefix operation (but that gives an absolute path).

With the above limitations, would a common_prefix function actually help typical use cases? In my experience, doing list operations on pathobj.parts is often simple enough that I don't need specialised functions like common_prefix...

Getting the edge cases right is fiddly enough for common_prefix that a specialised function is a reasonable idea, but only if the "obvious" behaviour is clear. If there's a lot of conflicting possibilities, maybe a recipe in the docs would be a better option.

Your are asking the same questions that were asked about os,path.commonpath(). You can look at discussions on the tracker, in Python-Dev and in Python-Ideas. http://bugs.python.org/issue10395 http://mail.python.org/pipermail/python-dev/2000-July/005897.html http://mail.python.org/pipermail/python-dev/2000-August/008385.html http://comments.gmane.org/gmane.comp.python.ideas/17719

Serhiy Storchaka

2:06 p.m.

On 21.03.15 06:41, Paddy3118 wrote:

...

I had to add a comment to an RC entry http://rosettacode..org/wiki/Longest_common_prefix#Pythonthat was using os.path..commonprefix to compute the longest common prefix which it does regardless of path separators, (which is what it should do).

It has been discussed before http://bugs.python.org/issue10395- it goes back to the nineties it seems, but we are still left with a misplaced function, (it should be str.commonprefix), with an awful hack of saying "yes we know its ballsed up" in the documentation https://docs.python.org/3.5/library/os.path.html?highlight=commonprefix#os.p....

I guess we missed a great opportunity to fix this when we moved to Python 3 too!?

The fix seems clear: deprecate os.path.commonprefix whilst creating a true os.path.commonpath and str.commonprefix. The deprecated function should hang around and the standard libs modified to switch to the new function(s)

I've heard that some religious people put obvious faults in their work as only their god should get so close to perfection - maybe that's why we still have this wart :-)

OK, if there are no objections and comments, I'll commit the recent patch for issue10395 (with only the doc change as Raymond suggested).

Paul Moore

2:14 p.m.

On 24 March 2015 at 18:06, Serhiy Storchaka wrote:

...

OK, if there are no objections and comments, I'll commit the recent patch for issue10395 (with only the doc change as Raymond suggested).

+1 from me. Paul

3315

Age (days ago)

3318

Last active (days ago)

List overview

Download

26 comments

8 participants

participants (8)

Andrew Barnert
Antoine Pitrou
Gregory P. Smith
Paddy3118
Paul Moore
random832＠fastmail.us
Serhiy Storchaka
Tal Einat

os.path.commonprefix: Yes that old chestnut.

random832＠fastmail.us

random832＠fastmail.us

random832＠fastmail.us

tags

participants (8)