Updated PEP 428 (pathlib)

Hello, I've updated PEP 428 following the previous discussion. Highlights: - the operator for combining paths is now `/`:
- the method for combining paths is now named `joinpath` - new as_uri() method to represent a path as a `file` URI http://www.python.org/dev/peps/pep-0428/ Regards Antoine.

Hello,
Hi,
I've updated PEP 428 following the previous discussion.
I really look forward to PEP 428 acceptance :-) Just a couple remarks:
I find the 'p.basename' name confusing: following POSIX conventions, 'basename' should be 'pathlib.tar.gz'. I don't have another 'name' to propose for the stripped name, though.
match() matches the path against a glob pattern:
I think it could be interesting to add an optional argument to mitigate glob-based DoS. Whether this should be made default is left as an exercise to the reader :-) cf

Hi! On Sun, Mar 03, 2013 at 09:46:16AM +0100, Charles-Fran?ois Natali <cf.natali@gmail.com> wrote:
Yes, and ntpath.py/posixpath.py follow this terminology.
I don't have another 'name' to propose for the stripped name, though.
ntpath.py/posixpath.py name these parts "root" and "extension". "Root", of course, has its own share of different connotations. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Sun, 3 Mar 2013 09:46:16 +0100 Charles-François Natali <cf.natali@gmail.com> wrote:
Yes. We could call it "root" (http://en.wikipedia.org/wiki/Root_%28linguistics%29) but in this context it would be confusing. Also, it's not exactly the root since as you point there can still be a remaining suffix. There's "stem", too (http://en.wikipedia.org/wiki/Word_stem). With the same provision about not being the actual stem.
You mean for glob() (match() is just a regex-like matcher, it doesn't do any I/O). Yes, I think we could add a `allow_recursive` argument. Is there any other DoS issue? Regards Antoie.

Indeed, "root" would be even more confusing.
There's "stem", too (http://en.wikipedia.org/wiki/Word_stem). With the same provision about not being the actual stem.
Also, it doesn't sound familiar (at least to me). How about "rootname", or "stripped_name" (the last one is a little too long)?
You mean for glob() (match() is just a regex-like matcher, it doesn't do any I/O).
Yes, I meant glob() (fnmatch() implementations can also be subject to DoS through stack exhaustion, but Python's implementation is based on regex).
Yes, I think we could add a `allow_recursive` argument. Is there any other DoS issue?
If by recursive you mean the '**' pattern (cross-directory match), then I'm afraid that's not enough. For example, a pattern like '*/../*/../*/../*/../*' would have the same problem: """ $ mkdir -p /tmp/foo/a /tmp/foo/b $ ~/python/cpython/python -c "from pathlib import *; p = Path('/tmp/foo'); print(list(p.glob('*/../*/../*/../*')))" [PosixPath('/tmp/foo/a/../a/../a/../a'), PosixPath('/tmp/foo/a/../a/../a/../b'), PosixPath('/tmp/foo/a/../a/../b/../a'), PosixPath('/tmp/foo/a/../a/../b/../b'), PosixPath('/tmp/foo/a/../b/../a/../a'), PosixPath('/tmp/foo/a/../b/../a/../b'), PosixPath('/tmp/foo/a/../b/../b/../a'), PosixPath('/tmp/foo/a/../b/../b/../b'), PosixPath('/tmp/foo/b/../a/../a/../a'), PosixPath('/tmp/foo/b/../a/../a/../b'), PosixPath('/tmp/foo/b/../a/../b/../a'), PosixPath('/tmp/foo/b/../a/../b/../b'), PosixPath('/tmp/foo/b/../b/../a/../a'), PosixPath('/tmp/foo/b/../b/../a/../b'), PosixPath('/tmp/foo/b/../b/../b/../a'), PosixPath('/tmp/foo/b/../b/../b/../b')] """ cf

Hello, 1. Ad: >>> PurePosixPath('/usr/bin/python').relative('/etc') Traceback (most recent call last): ... ValueError: ... Shouldn't this particular operation return "PurePosixPath('/etc/../usr/bin/python')"? 2. 03.03.2013 15:12, Charles-François Natali wrote:
Maybe simply "stripped"? Or "stemname"? Or "unsuffixed"?... Anyway "basename" is IMHO a bad idea because in os.path it is already used for something completely different. Cheers. *j

On 05.03.13 09:23, Antoine Pitrou wrote:
posixpath.relpath('/usr/bin/python', '/etc') returns '../usr/bin/python'. Perhaps pathlib should have an option to provide such compatible behavior. P.S. Pathlib implementation has relative_to() method. relative() method exists too but looks as unrelated.

Le Tue, 05 Mar 2013 16:58:57 +0200, Serhiy Storchaka <storchaka@gmail.com> a écrit :
I don't think so, since the behaviour is broken in the first place.
P.S. Pathlib implementation has relative_to() method. relative() method exists too but looks as unrelated.
Not in the "pep428" branch. Regards Antoine.

Le Sun, 3 Mar 2013 15:12:09 +0100, Charles-François Natali <cf.natali@gmail.com> a écrit :
"rootname" is confusing because of filesystem roots, and the second is too long (not to mention it's not obvious what has been stripped). I really prefer "basename" or, if people are hostile, "stem".
Mmmh, I don't know how to guard against that. Perhaps by disallowing ".." in glob patterns? But the problem could still appear with symlinks. To be honest I don't think allowing untrusted users to specify a glob pattern is a very good idea. On the other hand, for the common use cases such as configuration files, the user should be trustable. Regards Antoine.

On Tue, Mar 5, 2013, at 15:29, Andrew Barnert wrote:
Yes, but it requires you to pass in the extension. So what about p = '/foo/pathlib.tar.gz' p.basename() == 'pathlib.tar.gz' p.basename('.gz') == 'pathlib.tar' p.basename('.tar.gz') == 'pathlib' p.basename('.a') exception? 'pathlib.tar.gz'? p.basename(True) == 'pathlib.tar' p.basename(1) == 'pathlib.tar' p.basename(2) == 'pathlib' ?

On Mar 6, 2013, at 7:45, random832@fastmail.us wrote:
I was just pointing out that the name already does double duty, and therefore it's a few decades too late for people to complain about confusion, not suggesting that we emulate it. But now that I see your examples, it's not a bad idea. Anyone familiar with unix will immediately understand what basename('.gz') is doing and appreciate the shortcut, and the alternate forms you provided aren't surprising. The problem is that I think people will still want a stripext/stem/root/whatever function that removes the extension without also removing the dirname. In fact, I've got a script here where one of my coworkers has written $(dirname $z)/$(basename $z .zip) to get around it. And I don't think p.basename(True, False) is a viable answer. Besides, what names would you give the two flags? If there were good names for those, there would be good names for the two separate functions, right? Personally, I don't have a problem with the stem or stripext suggestions (although to me, the former implies stripping all extensions, the latter just one), but I guess they haven't gotten much traction.

On Wed, Mar 6, 2013, at 13:28, Andrew Barnert wrote:
Well, whatever you do is competing with p[:-len('.gz')] (and in shell, your co-worker could have done ${z%.zip} in ksh/bash/POSIX) - unless you have an example of an OS where the 'extension' component doesn't simply append at the end (well, I suppose they're not case-sensitive on windows - and how much weird long/short filename stuff does pathlib do on windows?), I'm not sure how useful a function for this is. Although, I was actually half-tempted to suggest p.basename - p.extension and propose subtraction as a general "remove matching suffix" operation on strings (or even without that, it could be defined on a class that .basename returns) I'm not sure what the use case is for doing this without already knowing the extension, to be honest. This isn't an operation the basename(1) tool supports, either [well, you could do $(basename $x .${x##*.}), but it doesn't support it by itself] -- Random832

From: "random832@fastmail.us" <random832@fastmail.us> Sent: Wednesday, March 6, 2013 11:15 AM
Well, whatever you do is competing with p[:-len('.gz')]
Only the "remove extension by name" is competing with that. The "remove any and all extensions" or "remove 1 extension" or "remove n extensions" cases—which I think are more common—are not. Of course they _are_ competing with various uses of partition/rpartition/split/rsplit. But you could make the same argument for everything in a path library—p.dirname() is competing with p.rpartition(pathsep)[0], and 'p / 'foo' is competing with p + pathsep + 'foo', and so on. As I understand it, the reason to have a path API is to provide a "one way to do it" that's obvious, readable, and hard to get wrong. And I think p.stripext('.gz') or p.stripext(2) or p.stripext() are better than p[:-len('.gz')] or p.rsplit('.', 2)[0] or p.partition('.')[0] in that regard. Except for the name, which nobody's come up with a good suggestion for, and the fact that we probably don't want to cram all three versions of the operation into one function, or even necessarily support all three at all.
Funnily enough, according to a comment, this was intended to handle all-caps .ZIP on Windows under cygwin. I don't know if it actually works (or, for that matter, if %.zip would have worked), as I don't have a Windows box with cygwin installed handy. But that's beside the point. The point is that people want a method that removes extensions (whether by name or otherwise) that doesn't also remove dirnames. And if you force them to come up with such a method themselves, they're not necessarily going to come up with a good one. I'm sure if we had your extended basename but no stripext, someone would write p.dirname()/p.basename('.zip').

On 2013-03-06 21:18, Andrew Barnert wrote:
For some reason, p.basename('.zip') feels 'wrong' to me, but p.basename() feels OK. Perhaps it's because it's not immediately obvious what the optional argument is for (or, at least, that's how it seems to me!). On the other hand, p.stripext('.zip') feels OK because it suggests that you're stripping off the '.zip' extension (compare with str.strip), but would that mean that you'd expect p.stripext() also to strip off the extension, whatever it was?

On Mar 6, 2013, at 14:14, MRAB <python@mrabarnett.plus.com> wrote:
Given the earlier responses on this thread, I think it's safe to say that plenty of people, like Paul Moore, would either expect it, or complain about its nonexistence. Anyway, I think getting side-tracked on random's suggestion (which, by the way, is almost entirely my fault) has detracted from the main point. Even if we _did_ have an extended basename that can strip both dirname and extension, with every option you could possible desire, people would _still_ want a method that strips the exception only. And calling that method basename (or root) is ambiguous and misleading to people who expect that name to mean pulling off directories rather than extensions. My original point was that Unix created this confusion by overloading the meaning of basename decades ago. The question is whether that's sufficient justification to use the name in python. I don't think it is (again, I'd prefer stem or splitext), but it was at least worth asking.

On Sun, Mar 3, 2013 at 9:12 AM, Charles-François Natali <cf.natali@gmail.com> wrote:
I don't know about stack exhaustion, but Python's regular expression implementation is agonizingly slow in the worst case, and fnmatch inherits this.
fnmatch.fnmatch('a'*50, '*a*'*50) # weird how the pattern/string order is reversed from re.match
That will take about 200 years to complete with CPython. Maybe a little less, if you're running a particularly fast computer. ;) Is that the sort of DoS issue you are looking for? -- Devin

On Wed, Mar 6, 2013 at 2:21 PM, Charles-François Natali <cf.natali@gmail.com> wrote:
Compile the glob pattern to an NFA and simulate the NFA efficiently (using Thompson's algorithm or even backtracking with memoization, rather than plain backtracking). For example, see: http://swtch.com/~rsc/regexp/regexp1.html -- Devin

Well, I'm sure it would be much better, but that would be a rather large piece of code, which would probably belong to a new regex engine, no?
Yes, I know about your regex implementation, but it looks like it's not going in anytime soon. Do you think you could come up with a reasonable - i.e. self-contained - patch for fnmatch?

On Wed, Mar 6, 2013 at 5:03 PM, Charles-François Natali <cf.natali@gmail.com> wrote:
It's a weekend worth of code at most. A simple regex engine is trivial, and glob is even simpler than that. (For example, we can use the glob pattern itself as the NFA during simulation). -- Devin

On Wed, Mar 6, 2013 at 3:20 PM, MRAB <python@mrabarnett.plus.com> wrote:
I wrote an alternative regex implementation (it's on PyPI). It's a lot more resistant to catastrophic backtracking.
How resistant? If it's possible to have catastrophic backtracking even if groups and backreferences aren't involved, then it wouldn't solve the fnmatch DOS problem. -- Devin

On Wed, Mar 6, 2013 at 5:40 PM, MRAB <python@mrabarnett.plus.com> wrote:
If there aren't capture groups, then you can use a DFA.
You can use a variant of DFA even if there are capture groups, it just takes more effort.
The re and regex modules use NFA because of the various other features required.
I take it you mean that both the re and regex modules use backtracking search. I was asking whether or not it can reach the exponential time worst-case on regexps without capture groups. If the answer is no, as you seem to be implying, then how does it prevent a DOS attack? -- Devin

On 2013-03-06 22:49, Devin Jeanpierre wrote:
Yes.
You _can_ have catastrophic backtracking without capture groups. You've already seen an example in ".*a.*". It gets worse when you can have repeated repeats, for example "(?:.*)*". The difference with fnmatch is that you don't care _where_ various parts match (there are no capture groups), only _whether_ it matches, and then only whether or not _all_ of it matches.

On Wed, Mar 6, 2013 at 6:25 PM, MRAB <python@mrabarnett.plus.com> wrote:
We seem to be talking past each other. I already know all this. I am asking you to justify your claim that if glob was based on regex, instead of re, it would be free of DOS attacks. Because of your confusion, I expect you didn't really mean to claim that. I inferred it because when you were asked for an approach that would solve DOS attacks against glob, you replied by saying that you wrote a regex module that is more resistant to such things. I apologize if I misunderstood. -- Devin

On 07/03/2013 03:23, Devin Jeanpierre wrote:
I didn't say that it should be based on regex. What I meant was that it didn't seem that difficult compared to the regex module. That module is more resistant to catastrophic backtracking and some of its tricks could be used for the much simpler fnmatch to make a new implementation of _that_ more resistant to the problem. I'm currently thinking about the details.

On 2013-03-06 17:51, MRAB wrote:
Here's a very simple, all-Python, implementation I've just cooked up: def fnmatch(name, pattern): positions = [(0, 0)] while positions: name_pos, pattern_pos = positions.pop() if pattern_pos >= len(pattern): if name_pos >= len(name): return True elif pattern[pattern_pos] == '*': if pattern_pos == len(pattern) - 1: return True positions.append((name_pos, pattern_pos + 1)) if name_pos < len(name): positions.append((name_pos + 1, pattern_pos)) elif pattern[pattern_pos] == '?': if name_pos < len(name): positions.append((name_pos + 1, pattern_pos + 1)) elif pattern[pattern_pos] == '[': if name_pos < len(name): negative = pattern[pattern_pos + 1] == "!" pattern_pos += 2 if negative else 1 close_pos = pattern.find(']', pattern_pos) if close_pos >= 0: if (name[name_pos] in pattern[pattern_pos : close_pos]) != negative: positions.append((name_pos + 1, close_pos + 1)) elif name_pos < len(name) and name[name_pos] == pattern[pattern_pos]: positions.append((name_pos + 1, pattern_pos + 1)) return False

On Wed, Mar 6, 2013 at 9:26 PM, MRAB <python@mrabarnett.plus.com> wrote:
Because positions is never culled of duplicate states, this suffers the exact same problem. In this case, fnmatch('a'*50, '*a*'*50) returns in 6500 years instead of 200. If you want to solve it, you should either affix it with a memoization cache, or use Thompson's algorithm instead of backtracking search. Since this isn't recursive, memoization is a bit annoying, though, so instead I modified it below to use Thompson's algorithm on NFA (with no error checking though): def fnmatch(name, pattern): positions = {0} for char in name: new_positions = set() for pattern_pos in positions: if pattern_pos >= len(pattern): continue pattern_char = pattern[pattern_pos] if pattern_char == '*': if pattern_pos == len(pattern) - 1: return True new_positions.update([pattern_pos, pattern_pos + 1]) elif pattern_char == '?': new_positions.add(pattern_pos + 1) elif pattern[pattern_pos] == '[': negative = pattern[pattern_pos + 1] == "!" pattern_pos += 2 if negative else 1 close_pos = pattern.index(']', pattern_pos) if (char in pattern[pattern_pos : close_pos]) != negative: new_positions.add(close_pos + 1) elif char == pattern_char: new_positions.add(pattern_pos + 1) positions = new_positions return len(pattern) in positions Backseatingly yours, -- Devin

This looks really promising. Could one of you open an issue on the tracker and attach a patch? Note that there's a problem with a current implementation: AssertionError: False is not true : expected 'abc' to match pattern '???*'

On Thu, Mar 7, 2013 at 3:23 AM, Charles-François Natali <cf.natali@gmail.com> wrote:
This looks really promising. Could one of you open an issue on the tracker and attach a patch?
The hard work is in making this a C extension module, which I don't know much about. I could do it this weekend if there's any chance at all it'd be accepted.
Note that there's a problem with a current implementation: AssertionError: False is not true : expected 'abc' to match pattern '???*'
My bad. I always make that mistake. :( --- def eps_closure(pattern, poses): for pos in poses: while pos < len(pattern) and pattern[pos] == '*': yield pos pos += 1 yield pos def fnmatch(name, pattern): positions = set(eps_closure(pattern, {0})) for char in name: new_positions = set() for pattern_pos in positions: if pattern_pos >= len(pattern): continue pattern_char = pattern[pattern_pos] if pattern_char == '*': if pattern_pos == len(pattern) - 1: return True new_positions.update([pattern_pos, pattern_pos + 1]) elif pattern_char == '?': new_positions.add(pattern_pos + 1) elif pattern[pattern_pos] == '[': negative = pattern[pattern_pos + 1] == "!" pattern_pos += 2 if negative else 1 close_pos = pattern.index(']', pattern_pos) if (char in pattern[pattern_pos : close_pos]) != negative: new_positions.add(close_pos + 1) elif char == pattern_char: new_positions.add(pattern_pos + 1) positions = set(eps_closure(pattern, new_positions)) return len(pattern) in positions -- Devin

On 07/03/2013 02:26, MRAB wrote:
[snip] And here's a new implementation: def fnmatch(name, pattern): saved_name_pos, saved_pattern_pos = 1, -1 name_pos, pattern_pos = 0, 0 while True: retry = False if pattern_pos >= len(pattern): if name_pos >= len(name): return True retry = True elif pattern[pattern_pos] == '*': saved_name_pos, saved_pattern_pos = name_pos + 1, pattern_pos pattern_pos += 1 elif pattern[pattern_pos] == '?': if name_pos < len(name): name_pos += 1 pattern_pos += 1 else: retry = True elif pattern[pattern_pos] == '[': if name_pos < len(name): negative = pattern[pattern_pos + 1] == "!" pattern_pos += 2 if negative else 1 close_pos = pattern.find(']', pattern_pos) if close_pos >= 0: if (name[name_pos] in pattern[pattern_pos : close_pos]) != negative: name_pos += 1 pattern_pos = close_pos + 1 else: retry = True else: retry = True else: retry = True elif name_pos < len(name) and name[name_pos] == pattern[pattern_pos]: name_pos += 1 pattern_pos += 1 else: retry = True if retry: if saved_pattern_pos < 0: return False name_pos, pattern_pos = saved_name_pos, saved_pattern_pos

Hello,
Hi,
I've updated PEP 428 following the previous discussion.
I really look forward to PEP 428 acceptance :-) Just a couple remarks:
I find the 'p.basename' name confusing: following POSIX conventions, 'basename' should be 'pathlib.tar.gz'. I don't have another 'name' to propose for the stripped name, though.
match() matches the path against a glob pattern:
I think it could be interesting to add an optional argument to mitigate glob-based DoS. Whether this should be made default is left as an exercise to the reader :-) cf

Hi! On Sun, Mar 03, 2013 at 09:46:16AM +0100, Charles-Fran?ois Natali <cf.natali@gmail.com> wrote:
Yes, and ntpath.py/posixpath.py follow this terminology.
I don't have another 'name' to propose for the stripped name, though.
ntpath.py/posixpath.py name these parts "root" and "extension". "Root", of course, has its own share of different connotations. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Sun, 3 Mar 2013 09:46:16 +0100 Charles-François Natali <cf.natali@gmail.com> wrote:
Yes. We could call it "root" (http://en.wikipedia.org/wiki/Root_%28linguistics%29) but in this context it would be confusing. Also, it's not exactly the root since as you point there can still be a remaining suffix. There's "stem", too (http://en.wikipedia.org/wiki/Word_stem). With the same provision about not being the actual stem.
You mean for glob() (match() is just a regex-like matcher, it doesn't do any I/O). Yes, I think we could add a `allow_recursive` argument. Is there any other DoS issue? Regards Antoie.

Indeed, "root" would be even more confusing.
There's "stem", too (http://en.wikipedia.org/wiki/Word_stem). With the same provision about not being the actual stem.
Also, it doesn't sound familiar (at least to me). How about "rootname", or "stripped_name" (the last one is a little too long)?
You mean for glob() (match() is just a regex-like matcher, it doesn't do any I/O).
Yes, I meant glob() (fnmatch() implementations can also be subject to DoS through stack exhaustion, but Python's implementation is based on regex).
Yes, I think we could add a `allow_recursive` argument. Is there any other DoS issue?
If by recursive you mean the '**' pattern (cross-directory match), then I'm afraid that's not enough. For example, a pattern like '*/../*/../*/../*/../*' would have the same problem: """ $ mkdir -p /tmp/foo/a /tmp/foo/b $ ~/python/cpython/python -c "from pathlib import *; p = Path('/tmp/foo'); print(list(p.glob('*/../*/../*/../*')))" [PosixPath('/tmp/foo/a/../a/../a/../a'), PosixPath('/tmp/foo/a/../a/../a/../b'), PosixPath('/tmp/foo/a/../a/../b/../a'), PosixPath('/tmp/foo/a/../a/../b/../b'), PosixPath('/tmp/foo/a/../b/../a/../a'), PosixPath('/tmp/foo/a/../b/../a/../b'), PosixPath('/tmp/foo/a/../b/../b/../a'), PosixPath('/tmp/foo/a/../b/../b/../b'), PosixPath('/tmp/foo/b/../a/../a/../a'), PosixPath('/tmp/foo/b/../a/../a/../b'), PosixPath('/tmp/foo/b/../a/../b/../a'), PosixPath('/tmp/foo/b/../a/../b/../b'), PosixPath('/tmp/foo/b/../b/../a/../a'), PosixPath('/tmp/foo/b/../b/../a/../b'), PosixPath('/tmp/foo/b/../b/../b/../a'), PosixPath('/tmp/foo/b/../b/../b/../b')] """ cf

Hello, 1. Ad: >>> PurePosixPath('/usr/bin/python').relative('/etc') Traceback (most recent call last): ... ValueError: ... Shouldn't this particular operation return "PurePosixPath('/etc/../usr/bin/python')"? 2. 03.03.2013 15:12, Charles-François Natali wrote:
Maybe simply "stripped"? Or "stemname"? Or "unsuffixed"?... Anyway "basename" is IMHO a bad idea because in os.path it is already used for something completely different. Cheers. *j

On 05.03.13 09:23, Antoine Pitrou wrote:
posixpath.relpath('/usr/bin/python', '/etc') returns '../usr/bin/python'. Perhaps pathlib should have an option to provide such compatible behavior. P.S. Pathlib implementation has relative_to() method. relative() method exists too but looks as unrelated.

Le Tue, 05 Mar 2013 16:58:57 +0200, Serhiy Storchaka <storchaka@gmail.com> a écrit :
I don't think so, since the behaviour is broken in the first place.
P.S. Pathlib implementation has relative_to() method. relative() method exists too but looks as unrelated.
Not in the "pep428" branch. Regards Antoine.

Le Sun, 3 Mar 2013 15:12:09 +0100, Charles-François Natali <cf.natali@gmail.com> a écrit :
"rootname" is confusing because of filesystem roots, and the second is too long (not to mention it's not obvious what has been stripped). I really prefer "basename" or, if people are hostile, "stem".
Mmmh, I don't know how to guard against that. Perhaps by disallowing ".." in glob patterns? But the problem could still appear with symlinks. To be honest I don't think allowing untrusted users to specify a glob pattern is a very good idea. On the other hand, for the common use cases such as configuration files, the user should be trustable. Regards Antoine.

On Tue, Mar 5, 2013, at 15:29, Andrew Barnert wrote:
Yes, but it requires you to pass in the extension. So what about p = '/foo/pathlib.tar.gz' p.basename() == 'pathlib.tar.gz' p.basename('.gz') == 'pathlib.tar' p.basename('.tar.gz') == 'pathlib' p.basename('.a') exception? 'pathlib.tar.gz'? p.basename(True) == 'pathlib.tar' p.basename(1) == 'pathlib.tar' p.basename(2) == 'pathlib' ?

On Mar 6, 2013, at 7:45, random832@fastmail.us wrote:
I was just pointing out that the name already does double duty, and therefore it's a few decades too late for people to complain about confusion, not suggesting that we emulate it. But now that I see your examples, it's not a bad idea. Anyone familiar with unix will immediately understand what basename('.gz') is doing and appreciate the shortcut, and the alternate forms you provided aren't surprising. The problem is that I think people will still want a stripext/stem/root/whatever function that removes the extension without also removing the dirname. In fact, I've got a script here where one of my coworkers has written $(dirname $z)/$(basename $z .zip) to get around it. And I don't think p.basename(True, False) is a viable answer. Besides, what names would you give the two flags? If there were good names for those, there would be good names for the two separate functions, right? Personally, I don't have a problem with the stem or stripext suggestions (although to me, the former implies stripping all extensions, the latter just one), but I guess they haven't gotten much traction.

On Wed, Mar 6, 2013, at 13:28, Andrew Barnert wrote:
Well, whatever you do is competing with p[:-len('.gz')] (and in shell, your co-worker could have done ${z%.zip} in ksh/bash/POSIX) - unless you have an example of an OS where the 'extension' component doesn't simply append at the end (well, I suppose they're not case-sensitive on windows - and how much weird long/short filename stuff does pathlib do on windows?), I'm not sure how useful a function for this is. Although, I was actually half-tempted to suggest p.basename - p.extension and propose subtraction as a general "remove matching suffix" operation on strings (or even without that, it could be defined on a class that .basename returns) I'm not sure what the use case is for doing this without already knowing the extension, to be honest. This isn't an operation the basename(1) tool supports, either [well, you could do $(basename $x .${x##*.}), but it doesn't support it by itself] -- Random832

From: "random832@fastmail.us" <random832@fastmail.us> Sent: Wednesday, March 6, 2013 11:15 AM
Well, whatever you do is competing with p[:-len('.gz')]
Only the "remove extension by name" is competing with that. The "remove any and all extensions" or "remove 1 extension" or "remove n extensions" cases—which I think are more common—are not. Of course they _are_ competing with various uses of partition/rpartition/split/rsplit. But you could make the same argument for everything in a path library—p.dirname() is competing with p.rpartition(pathsep)[0], and 'p / 'foo' is competing with p + pathsep + 'foo', and so on. As I understand it, the reason to have a path API is to provide a "one way to do it" that's obvious, readable, and hard to get wrong. And I think p.stripext('.gz') or p.stripext(2) or p.stripext() are better than p[:-len('.gz')] or p.rsplit('.', 2)[0] or p.partition('.')[0] in that regard. Except for the name, which nobody's come up with a good suggestion for, and the fact that we probably don't want to cram all three versions of the operation into one function, or even necessarily support all three at all.
Funnily enough, according to a comment, this was intended to handle all-caps .ZIP on Windows under cygwin. I don't know if it actually works (or, for that matter, if %.zip would have worked), as I don't have a Windows box with cygwin installed handy. But that's beside the point. The point is that people want a method that removes extensions (whether by name or otherwise) that doesn't also remove dirnames. And if you force them to come up with such a method themselves, they're not necessarily going to come up with a good one. I'm sure if we had your extended basename but no stripext, someone would write p.dirname()/p.basename('.zip').

On 2013-03-06 21:18, Andrew Barnert wrote:
For some reason, p.basename('.zip') feels 'wrong' to me, but p.basename() feels OK. Perhaps it's because it's not immediately obvious what the optional argument is for (or, at least, that's how it seems to me!). On the other hand, p.stripext('.zip') feels OK because it suggests that you're stripping off the '.zip' extension (compare with str.strip), but would that mean that you'd expect p.stripext() also to strip off the extension, whatever it was?

On Mar 6, 2013, at 14:14, MRAB <python@mrabarnett.plus.com> wrote:
Given the earlier responses on this thread, I think it's safe to say that plenty of people, like Paul Moore, would either expect it, or complain about its nonexistence. Anyway, I think getting side-tracked on random's suggestion (which, by the way, is almost entirely my fault) has detracted from the main point. Even if we _did_ have an extended basename that can strip both dirname and extension, with every option you could possible desire, people would _still_ want a method that strips the exception only. And calling that method basename (or root) is ambiguous and misleading to people who expect that name to mean pulling off directories rather than extensions. My original point was that Unix created this confusion by overloading the meaning of basename decades ago. The question is whether that's sufficient justification to use the name in python. I don't think it is (again, I'd prefer stem or splitext), but it was at least worth asking.

On Sun, Mar 3, 2013 at 9:12 AM, Charles-François Natali <cf.natali@gmail.com> wrote:
I don't know about stack exhaustion, but Python's regular expression implementation is agonizingly slow in the worst case, and fnmatch inherits this.
fnmatch.fnmatch('a'*50, '*a*'*50) # weird how the pattern/string order is reversed from re.match
That will take about 200 years to complete with CPython. Maybe a little less, if you're running a particularly fast computer. ;) Is that the sort of DoS issue you are looking for? -- Devin

On Wed, Mar 6, 2013 at 2:21 PM, Charles-François Natali <cf.natali@gmail.com> wrote:
Compile the glob pattern to an NFA and simulate the NFA efficiently (using Thompson's algorithm or even backtracking with memoization, rather than plain backtracking). For example, see: http://swtch.com/~rsc/regexp/regexp1.html -- Devin

Well, I'm sure it would be much better, but that would be a rather large piece of code, which would probably belong to a new regex engine, no?
Yes, I know about your regex implementation, but it looks like it's not going in anytime soon. Do you think you could come up with a reasonable - i.e. self-contained - patch for fnmatch?

On Wed, Mar 6, 2013 at 5:03 PM, Charles-François Natali <cf.natali@gmail.com> wrote:
It's a weekend worth of code at most. A simple regex engine is trivial, and glob is even simpler than that. (For example, we can use the glob pattern itself as the NFA during simulation). -- Devin

On Wed, Mar 6, 2013 at 3:20 PM, MRAB <python@mrabarnett.plus.com> wrote:
I wrote an alternative regex implementation (it's on PyPI). It's a lot more resistant to catastrophic backtracking.
How resistant? If it's possible to have catastrophic backtracking even if groups and backreferences aren't involved, then it wouldn't solve the fnmatch DOS problem. -- Devin

On Wed, Mar 6, 2013 at 5:40 PM, MRAB <python@mrabarnett.plus.com> wrote:
If there aren't capture groups, then you can use a DFA.
You can use a variant of DFA even if there are capture groups, it just takes more effort.
The re and regex modules use NFA because of the various other features required.
I take it you mean that both the re and regex modules use backtracking search. I was asking whether or not it can reach the exponential time worst-case on regexps without capture groups. If the answer is no, as you seem to be implying, then how does it prevent a DOS attack? -- Devin

On 2013-03-06 22:49, Devin Jeanpierre wrote:
Yes.
You _can_ have catastrophic backtracking without capture groups. You've already seen an example in ".*a.*". It gets worse when you can have repeated repeats, for example "(?:.*)*". The difference with fnmatch is that you don't care _where_ various parts match (there are no capture groups), only _whether_ it matches, and then only whether or not _all_ of it matches.

On Wed, Mar 6, 2013 at 6:25 PM, MRAB <python@mrabarnett.plus.com> wrote:
We seem to be talking past each other. I already know all this. I am asking you to justify your claim that if glob was based on regex, instead of re, it would be free of DOS attacks. Because of your confusion, I expect you didn't really mean to claim that. I inferred it because when you were asked for an approach that would solve DOS attacks against glob, you replied by saying that you wrote a regex module that is more resistant to such things. I apologize if I misunderstood. -- Devin

On 07/03/2013 03:23, Devin Jeanpierre wrote:
I didn't say that it should be based on regex. What I meant was that it didn't seem that difficult compared to the regex module. That module is more resistant to catastrophic backtracking and some of its tricks could be used for the much simpler fnmatch to make a new implementation of _that_ more resistant to the problem. I'm currently thinking about the details.

On 2013-03-06 17:51, MRAB wrote:
Here's a very simple, all-Python, implementation I've just cooked up: def fnmatch(name, pattern): positions = [(0, 0)] while positions: name_pos, pattern_pos = positions.pop() if pattern_pos >= len(pattern): if name_pos >= len(name): return True elif pattern[pattern_pos] == '*': if pattern_pos == len(pattern) - 1: return True positions.append((name_pos, pattern_pos + 1)) if name_pos < len(name): positions.append((name_pos + 1, pattern_pos)) elif pattern[pattern_pos] == '?': if name_pos < len(name): positions.append((name_pos + 1, pattern_pos + 1)) elif pattern[pattern_pos] == '[': if name_pos < len(name): negative = pattern[pattern_pos + 1] == "!" pattern_pos += 2 if negative else 1 close_pos = pattern.find(']', pattern_pos) if close_pos >= 0: if (name[name_pos] in pattern[pattern_pos : close_pos]) != negative: positions.append((name_pos + 1, close_pos + 1)) elif name_pos < len(name) and name[name_pos] == pattern[pattern_pos]: positions.append((name_pos + 1, pattern_pos + 1)) return False

On Wed, Mar 6, 2013 at 9:26 PM, MRAB <python@mrabarnett.plus.com> wrote:
Because positions is never culled of duplicate states, this suffers the exact same problem. In this case, fnmatch('a'*50, '*a*'*50) returns in 6500 years instead of 200. If you want to solve it, you should either affix it with a memoization cache, or use Thompson's algorithm instead of backtracking search. Since this isn't recursive, memoization is a bit annoying, though, so instead I modified it below to use Thompson's algorithm on NFA (with no error checking though): def fnmatch(name, pattern): positions = {0} for char in name: new_positions = set() for pattern_pos in positions: if pattern_pos >= len(pattern): continue pattern_char = pattern[pattern_pos] if pattern_char == '*': if pattern_pos == len(pattern) - 1: return True new_positions.update([pattern_pos, pattern_pos + 1]) elif pattern_char == '?': new_positions.add(pattern_pos + 1) elif pattern[pattern_pos] == '[': negative = pattern[pattern_pos + 1] == "!" pattern_pos += 2 if negative else 1 close_pos = pattern.index(']', pattern_pos) if (char in pattern[pattern_pos : close_pos]) != negative: new_positions.add(close_pos + 1) elif char == pattern_char: new_positions.add(pattern_pos + 1) positions = new_positions return len(pattern) in positions Backseatingly yours, -- Devin

This looks really promising. Could one of you open an issue on the tracker and attach a patch? Note that there's a problem with a current implementation: AssertionError: False is not true : expected 'abc' to match pattern '???*'

On Thu, Mar 7, 2013 at 3:23 AM, Charles-François Natali <cf.natali@gmail.com> wrote:
This looks really promising. Could one of you open an issue on the tracker and attach a patch?
The hard work is in making this a C extension module, which I don't know much about. I could do it this weekend if there's any chance at all it'd be accepted.
Note that there's a problem with a current implementation: AssertionError: False is not true : expected 'abc' to match pattern '???*'
My bad. I always make that mistake. :( --- def eps_closure(pattern, poses): for pos in poses: while pos < len(pattern) and pattern[pos] == '*': yield pos pos += 1 yield pos def fnmatch(name, pattern): positions = set(eps_closure(pattern, {0})) for char in name: new_positions = set() for pattern_pos in positions: if pattern_pos >= len(pattern): continue pattern_char = pattern[pattern_pos] if pattern_char == '*': if pattern_pos == len(pattern) - 1: return True new_positions.update([pattern_pos, pattern_pos + 1]) elif pattern_char == '?': new_positions.add(pattern_pos + 1) elif pattern[pattern_pos] == '[': negative = pattern[pattern_pos + 1] == "!" pattern_pos += 2 if negative else 1 close_pos = pattern.index(']', pattern_pos) if (char in pattern[pattern_pos : close_pos]) != negative: new_positions.add(close_pos + 1) elif char == pattern_char: new_positions.add(pattern_pos + 1) positions = set(eps_closure(pattern, new_positions)) return len(pattern) in positions -- Devin

On 07/03/2013 02:26, MRAB wrote:
[snip] And here's a new implementation: def fnmatch(name, pattern): saved_name_pos, saved_pattern_pos = 1, -1 name_pos, pattern_pos = 0, 0 while True: retry = False if pattern_pos >= len(pattern): if name_pos >= len(name): return True retry = True elif pattern[pattern_pos] == '*': saved_name_pos, saved_pattern_pos = name_pos + 1, pattern_pos pattern_pos += 1 elif pattern[pattern_pos] == '?': if name_pos < len(name): name_pos += 1 pattern_pos += 1 else: retry = True elif pattern[pattern_pos] == '[': if name_pos < len(name): negative = pattern[pattern_pos + 1] == "!" pattern_pos += 2 if negative else 1 close_pos = pattern.find(']', pattern_pos) if close_pos >= 0: if (name[name_pos] in pattern[pattern_pos : close_pos]) != negative: name_pos += 1 pattern_pos = close_pos + 1 else: retry = True else: retry = True else: retry = True elif name_pos < len(name) and name[name_pos] == pattern[pattern_pos]: name_pos += 1 pattern_pos += 1 else: retry = True if retry: if saved_pattern_pos < 0: return False name_pos, pattern_pos = saved_name_pos, saved_pattern_pos
participants (11)
-
Andrew Barnert
-
Antoine Pitrou
-
Charles-François Natali
-
Devin Jeanpierre
-
Ethan Furman
-
Greg Ewing
-
Jan Kaliszewski
-
MRAB
-
Oleg Broytman
-
random832@fastmail.us
-
Serhiy Storchaka