Mailman 3 Updated PEP 428 (pathlib) - Python-ideas - python.org

newer
Re: [Python-ideas] Length hinting...

Updated PEP 428 (pathlib)

older
Updated PEP 434: IDLE Enhancement...

Antoine Pitrou

March 2, 2013

5:51 p.m.

Hello, I've updated PEP 428 following the previous discussion. Highlights: - the operator for combining paths is now `/`:

- the method for combining paths is now named `joinpath` - new as_uri() method to represent a path as a `file` URI http://www.python.org/dev/peps/pep-0428/ Regards Antoine.

Reply

Sign in to reply online Use email software

Show replies by date

Charles-François Natali

March 2013

2:46 a.m.

Hello,

Hi,

I've updated PEP 428 following the previous discussion.

I really look forward to PEP 428 acceptance :-) Just a couple remarks:

...
...
p = PureNTPath('c:/Downloads/pathlib.tar.gz') p.name 'pathlib.tar.gz' p.basename 'pathlib.tar' p.suffix '.gz'

I find the 'p.basename' name confusing: following POSIX conventions, 'basename' should be 'pathlib.tar.gz'. I don't have another 'name' to propose for the stripped name, though.

match() matches the path against a glob pattern:

I think it could be interesting to add an optional argument to mitigate glob-based DoS. Whether this should be made default is left as an exercise to the reader :-) cf

Reply

Sign in to reply online Use email software

Oleg Broytman

3:14 a.m.

Hi! On Sun, Mar 03, 2013 at 09:46:16AM +0100, Charles-Fran?ois Natali <cf.natali@gmail.com> wrote:

Yes, and ntpath.py/posixpath.py follow this terminology.

I don't have another 'name' to propose for the stripped name, though.

ntpath.py/posixpath.py name these parts "root" and "extension". "Root", of course, has its own share of different connotations. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

Reply

Sign in to reply online Use email software

Antoine Pitrou

4:41 a.m.

On Sun, 3 Mar 2013 09:46:16 +0100 Charles-François Natali <cf.natali@gmail.com> wrote:

Yes. We could call it "root" (http://en.wikipedia.org/wiki/Root_%28linguistics%29) but in this context it would be confusing. Also, it's not exactly the root since as you point there can still be a remaining suffix. There's "stem", too (http://en.wikipedia.org/wiki/Word_stem). With the same provision about not being the actual stem.

You mean for glob() (match() is just a regex-like matcher, it doesn't do any I/O). Yes, I think we could add a `allow_recursive` argument. Is there any other DoS issue? Regards Antoie.

Reply

Sign in to reply online Use email software

Charles-François Natali

8:12 a.m.

Yes. We could call it "root" (http://en.wikipedia.org/wiki/Root_%28linguistics%29) but in this context it would be confusing. Also, it's not exactly the root since as you point there can still be a remaining suffix.

Indeed, "root" would be even more confusing.

There's "stem", too (http://en.wikipedia.org/wiki/Word_stem). With the same provision about not being the actual stem.

Also, it doesn't sound familiar (at least to me). How about "rootname", or "stripped_name" (the last one is a little too long)?

You mean for glob() (match() is just a regex-like matcher, it doesn't do any I/O).

Yes, I meant glob() (fnmatch() implementations can also be subject to DoS through stack exhaustion, but Python's implementation is based on regex).

Yes, I think we could add a `allow_recursive` argument. Is there any other DoS issue?

If by recursive you mean the '**' pattern (cross-directory match), then I'm afraid that's not enough. For example, a pattern like '*/../*/../*/../*/../*' would have the same problem: """ $ mkdir -p /tmp/foo/a /tmp/foo/b $ ~/python/cpython/python -c "from pathlib import *; p = Path('/tmp/foo'); print(list(p.glob('*/../*/../*/../*')))" [PosixPath('/tmp/foo/a/../a/../a/../a'), PosixPath('/tmp/foo/a/../a/../a/../b'), PosixPath('/tmp/foo/a/../a/../b/../a'), PosixPath('/tmp/foo/a/../a/../b/../b'), PosixPath('/tmp/foo/a/../b/../a/../a'), PosixPath('/tmp/foo/a/../b/../a/../b'), PosixPath('/tmp/foo/a/../b/../b/../a'), PosixPath('/tmp/foo/a/../b/../b/../b'), PosixPath('/tmp/foo/b/../a/../a/../a'), PosixPath('/tmp/foo/b/../a/../a/../b'), PosixPath('/tmp/foo/b/../a/../b/../a'), PosixPath('/tmp/foo/b/../a/../b/../b'), PosixPath('/tmp/foo/b/../b/../a/../a'), PosixPath('/tmp/foo/b/../b/../a/../b'), PosixPath('/tmp/foo/b/../b/../b/../a'), PosixPath('/tmp/foo/b/../b/../b/../b')] """ cf

Reply

Sign in to reply online Use email software

Jan Kaliszewski

5:33 p.m.

Hello, 1. Ad: >>> PurePosixPath('/usr/bin/python').relative('/etc') Traceback (most recent call last): ... ValueError: ... Shouldn't this particular operation return "PurePosixPath('/etc/../usr/bin/python')"? 2. 03.03.2013 15:12, Charles-François Natali wrote:

Maybe simply "stripped"? Or "stemname"? Or "unsuffixed"?... Anyway "basename" is IMHO a bad idea because in os.path it is already used for something completely different. Cheers. *j

Reply

Sign in to reply online Use email software

Jan Kaliszewski

5:39 p.m.

Me wrote:

Shouldn't this particular operation return "PurePosixPath('/etc/../usr/bin/python')"?

Pardon, I ment: "PurePosixPath('../usr/bin/python')". *j

Reply

Sign in to reply online Use email software

Antoine Pitrou

1:23 a.m.

On Tue, 05 Mar 2013 00:33:48 +0100 Jan Kaliszewski <zuo@chopin.edu.pl> wrote:

Think what happens if /etc is a symlink to /var/etc. (not very likely to happen for /etc, but likely to happen in the general case) Regards Antoine.

Reply

Sign in to reply online Use email software

Serhiy Storchaka

8:58 a.m.

On 05.03.13 09:23, Antoine Pitrou wrote:

posixpath.relpath('/usr/bin/python', '/etc') returns '../usr/bin/python'. Perhaps pathlib should have an option to provide such compatible behavior. P.S. Pathlib implementation has relative_to() method. relative() method exists too but looks as unrelated.

Reply

Sign in to reply online Use email software

Antoine Pitrou

4:24 a.m.

Le Tue, 05 Mar 2013 16:58:57 +0200, Serhiy Storchaka <storchaka@gmail.com> a écrit :

I don't think so, since the behaviour is broken in the first place.

P.S. Pathlib implementation has relative_to() method. relative() method exists too but looks as unrelated.

Not in the "pep428" branch. Regards Antoine.

Reply

Sign in to reply online Use email software

Antoine Pitrou

3:16 a.m.

Le Sun, 3 Mar 2013 15:12:09 +0100, Charles-François Natali <cf.natali@gmail.com> a écrit :

"rootname" is confusing because of filesystem roots, and the second is too long (not to mention it's not obvious what has been stripped). I really prefer "basename" or, if people are hostile, "stem".

Mmmh, I don't know how to guard against that. Perhaps by disallowing ".." in glob patterns? But the problem could still appear with symlinks. To be honest I don't think allowing untrusted users to specify a glob pattern is a very good idea. On the other hand, for the common use cases such as configuration files, the user should be trustable. Regards Antoine.

Reply

Sign in to reply online Use email software

Ethan Furman

9:23 a.m.

On 03/05/2013 01:16 AM, Antoine Pitrou wrote:

+1 -- ~Ethan~

Reply

Sign in to reply online Use email software

Andrew Barnert

2:29 p.m.

Correct me if I'm remembering wrong, but the posix basename(1) tool can strip both dirnames _and_ extensions. So, any confusion here has a solid precedent. Sent from a random iPhone On Mar 5, 2013, at 7:23, Ethan Furman <ethan@stoneleaf.us> wrote:

Reply

Sign in to reply online Use email software

random832＠fastmail.us

9:45 a.m.

On Tue, Mar 5, 2013, at 15:29, Andrew Barnert wrote:

Yes, but it requires you to pass in the extension. So what about p = '/foo/pathlib.tar.gz' p.basename() == 'pathlib.tar.gz' p.basename('.gz') == 'pathlib.tar' p.basename('.tar.gz') == 'pathlib' p.basename('.a') exception? 'pathlib.tar.gz'? p.basename(True) == 'pathlib.tar' p.basename(1) == 'pathlib.tar' p.basename(2) == 'pathlib' ?

Reply

Sign in to reply online Use email software

Andrew Barnert

12:28 p.m.

On Mar 6, 2013, at 7:45, random832@fastmail.us wrote:

I was just pointing out that the name already does double duty, and therefore it's a few decades too late for people to complain about confusion, not suggesting that we emulate it. But now that I see your examples, it's not a bad idea. Anyone familiar with unix will immediately understand what basename('.gz') is doing and appreciate the shortcut, and the alternate forms you provided aren't surprising. The problem is that I think people will still want a stripext/stem/root/whatever function that removes the extension without also removing the dirname. In fact, I've got a script here where one of my coworkers has written $(dirname $z)/$(basename $z .zip) to get around it. And I don't think p.basename(True, False) is a viable answer. Besides, what names would you give the two flags? If there were good names for those, there would be good names for the two separate functions, right? Personally, I don't have a problem with the stem or stripext suggestions (although to me, the former implies stripping all extensions, the latter just one), but I guess they haven't gotten much traction.

Reply

Sign in to reply online Use email software

random832＠fastmail.us

1:15 p.m.

On Wed, Mar 6, 2013, at 13:28, Andrew Barnert wrote:

Well, whatever you do is competing with p[:-len('.gz')] (and in shell, your co-worker could have done ${z%.zip} in ksh/bash/POSIX) - unless you have an example of an OS where the 'extension' component doesn't simply append at the end (well, I suppose they're not case-sensitive on windows - and how much weird long/short filename stuff does pathlib do on windows?), I'm not sure how useful a function for this is. Although, I was actually half-tempted to suggest p.basename - p.extension and propose subtraction as a general "remove matching suffix" operation on strings (or even without that, it could be defined on a class that .basename returns) I'm not sure what the use case is for doing this without already knowing the extension, to be honest. This isn't an operation the basename(1) tool supports, either [well, you could do $(basename $x .${x##*.}), but it doesn't support it by itself] -- Random832

Reply

Sign in to reply online Use email software

Andrew Barnert

3:18 p.m.

From: "random832@fastmail.us" <random832@fastmail.us> Sent: Wednesday, March 6, 2013 11:15 AM

On Wed, Mar 6, 2013, at 13:28, Andrew Barnert wrote:

...
The problem is that I think people will still want a stripext/stem/root/whatever function that removes the extension without also removing the dirname. In fact, I've got a script here where one of my coworkers has written $(dirname $z)/$(basename $z .zip) to get around it. …

Well, whatever you do is competing with p[:-len('.gz')]

Only the "remove extension by name" is competing with that. The "remove any and all extensions" or "remove 1 extension" or "remove n extensions" cases—which I think are more common—are not. Of course they _are_ competing with various uses of partition/rpartition/split/rsplit. But you could make the same argument for everything in a path library—p.dirname() is competing with p.rpartition(pathsep)[0], and 'p / 'foo' is competing with p + pathsep + 'foo', and so on. As I understand it, the reason to have a path API is to provide a "one way to do it" that's obvious, readable, and hard to get wrong. And I think p.stripext('.gz') or p.stripext(2) or p.stripext() are better than p[:-len('.gz')] or p.rsplit('.', 2)[0] or p.partition('.')[0] in that regard. Except for the name, which nobody's come up with a good suggestion for, and the fact that we probably don't want to cram all three versions of the operation into one function, or even necessarily support all three at all.

(and in shell, your co-worker could have done ${z%.zip} in ksh/bash/POSIX) - unless you have an example of an OS where the 'extension' component doesn't simply append at the end (well, I suppose they're not case-sensitive on windows - and how much weird long/short filename stuff does pathlib do on windows?)

Funnily enough, according to a comment, this was intended to handle all-caps .ZIP on Windows under cygwin. I don't know if it actually works (or, for that matter, if %.zip would have worked), as I don't have a Windows box with cygwin installed handy. But that's beside the point. The point is that people want a method that removes extensions (whether by name or otherwise) that doesn't also remove dirnames. And if you force them to come up with such a method themselves, they're not necessarily going to come up with a good one. I'm sure if we had your extended basename but no stripext, someone would write p.dirname()/p.basename('.zip').

Reply

Sign in to reply online Use email software

MRAB

4:14 p.m.

On 2013-03-06 21:18, Andrew Barnert wrote:

From: "random832@fastmail.us" <random832@fastmail.us>

Sent: Wednesday, March 6, 2013 11:15 AM

...
On Wed, Mar 6, 2013, at 13:28, Andrew Barnert wrote:

...
The problem is that I think people will still want a stripext/stem/root/whatever function that removes the extension without also removing the dirname. In fact, I've got a script here where one of my coworkers has written $(dirname $z)/$(basename $z .zip) to get around it. …

...
Well, whatever you do is competing with p[:-len('.gz')]

Only the "remove extension by name" is competing with that. The "remove any and all extensions" or "remove 1 extension" or "remove n extensions" cases—which I think are more common—are not. Of course they _are_ competing with various uses of partition/rpartition/split/rsplit. But you could make the same argument for everything in a path library—p.dirname() is competing with p.rpartition(pathsep)[0], and 'p / 'foo' is competing with p + pathsep + 'foo', and so on.

As I understand it, the reason to have a path API is to provide a "one way to do it" that's obvious, readable, and hard to get wrong. And I think p.stripext('.gz') or p.stripext(2) or p.stripext() are better than p[:-len('.gz')] or p.rsplit('.', 2)[0] or p.partition('.')[0] in that regard.

Except for the name, which nobody's come up with a good suggestion for, and the fact that we probably don't want to cram all three versions of the operation into one function, or even necessarily support all three at all.

...
(and in shell, your co-worker could have done ${z%.zip} in ksh/bash/POSIX) - unless you have an example of an OS where the 'extension' component doesn't simply append at the end (well, I suppose they're not case-sensitive on windows - and how much weird long/short filename stuff does pathlib do on windows?)

Funnily enough, according to a comment, this was intended to handle all-caps .ZIP on Windows under cygwin. I don't know if it actually works (or, for that matter, if %.zip would have worked), as I don't have a Windows box with cygwin installed handy. But that's beside the point. The point is that people want a method that removes extensions (whether by name or otherwise) that doesn't also remove dirnames. And if you force them to come up with such a method themselves, they're not necessarily going to come up with a good one. I'm sure if we had your extended basename but no stripext, someone would write p.dirname()/p.basename('.zip').

For some reason, p.basename('.zip') feels 'wrong' to me, but p.basename() feels OK. Perhaps it's because it's not immediately obvious what the optional argument is for (or, at least, that's how it seems to me!). On the other hand, p.stripext('.zip') feels OK because it suggests that you're stripping off the '.zip' extension (compare with str.strip), but would that mean that you'd expect p.stripext() also to strip off the extension, whatever it was?

Reply

Sign in to reply online Use email software

Andrew Barnert

9:15 p.m.

On Mar 6, 2013, at 14:14, MRAB <python@mrabarnett.plus.com> wrote:

Given the earlier responses on this thread, I think it's safe to say that plenty of people, like Paul Moore, would either expect it, or complain about its nonexistence. Anyway, I think getting side-tracked on random's suggestion (which, by the way, is almost entirely my fault) has detracted from the main point. Even if we _did_ have an extended basename that can strip both dirname and extension, with every option you could possible desire, people would _still_ want a method that strips the exception only. And calling that method basename (or root) is ambiguous and misleading to people who expect that name to mean pulling off directories rather than extensions. My original point was that Unix created this confusion by overloading the meaning of basename decades ago. The question is whether that's sufficient justification to use the name in python. I don't think it is (again, I'd prefer stem or splitext), but it was at least worth asking.

Reply

Sign in to reply online Use email software

Devin Jeanpierre

10:31 a.m.

On Sun, Mar 3, 2013 at 9:12 AM, Charles-François Natali <cf.natali@gmail.com> wrote:

I don't know about stack exhaustion, but Python's regular expression implementation is agonizingly slow in the worst case, and fnmatch inherits this.

...
...
fnmatch.fnmatch('a'*50, '*a*'*50) # weird how the pattern/string order is reversed from re.match

That will take about 200 years to complete with CPython. Maybe a little less, if you're running a particularly fast computer. ;) Is that the sort of DoS issue you are looking for? -- Devin

Reply

Sign in to reply online Use email software

Charles-François Natali

11:08 a.m.

Exactly (the complexity of a typical ad-hoc fnmatch() implementation is the reason some servers like vsftpd use their own version, and it's even worse with a regex-based implementation as you notice). Now, the question is whether we want to try to mitigate this or not...

Reply

Sign in to reply online Use email software

MRAB

11:51 a.m.

On 2013-03-06 17:08, Charles-François Natali wrote:

It's not something I've ever used, but it doesn't look that difficult compared to regex if all it has is "*", "?", "[...]" and "[!...]".

Reply

Sign in to reply online Use email software

Charles-François Natali

1:21 p.m.

What's not difficult? Avoiding DoS with arbitrary glob patterns? If yes, please share your idea :-)

Reply

Sign in to reply online Use email software

Devin Jeanpierre

2:01 p.m.

On Wed, Mar 6, 2013 at 2:21 PM, Charles-François Natali <cf.natali@gmail.com> wrote:

Compile the glob pattern to an NFA and simulate the NFA efficiently (using Thompson's algorithm or even backtracking with memoization, rather than plain backtracking). For example, see: http://swtch.com/~rsc/regexp/regexp1.html -- Devin

Reply

Sign in to reply online Use email software

Charles-François Natali

4:03 p.m.

Well, I'm sure it would be much better, but that would be a rather large piece of code, which would probably belong to a new regex engine, no?

Yes, I know about your regex implementation, but it looks like it's not going in anytime soon. Do you think you could come up with a reasonable - i.e. self-contained - patch for fnmatch?

Reply

Sign in to reply online Use email software

Devin Jeanpierre

4:22 p.m.

On Wed, Mar 6, 2013 at 5:03 PM, Charles-François Natali <cf.natali@gmail.com> wrote:

It's a weekend worth of code at most. A simple regex engine is trivial, and glob is even simpler than that. (For example, we can use the glob pattern itself as the NFA during simulation). -- Devin

Reply

Sign in to reply online Use email software

MRAB

4:24 p.m.

On 2013-03-06 22:03, Charles-François Natali wrote:

I think it's worth looking at!

Reply

Sign in to reply online Use email software

MRAB

2:20 p.m.

On 2013-03-06 19:21, Charles-François Natali wrote:

I wrote an alternative regex implementation (it's on PyPI). It's a lot more resistant to catastrophic backtracking. What's need for fnmatch is a lot simpler than that! :-)

Reply

Sign in to reply online Use email software

Devin Jeanpierre

4:35 p.m.

On Wed, Mar 6, 2013 at 3:20 PM, MRAB <python@mrabarnett.plus.com> wrote:

I wrote an alternative regex implementation (it's on PyPI). It's a lot more resistant to catastrophic backtracking.

How resistant? If it's possible to have catastrophic backtracking even if groups and backreferences aren't involved, then it wouldn't solve the fnmatch DOS problem. -- Devin

Reply

Sign in to reply online Use email software

MRAB

4:40 p.m.

On 2013-03-06 22:35, Devin Jeanpierre wrote:

If there aren't capture groups, then you can use a DFA. The re and regex modules use NFA because of the various other features required.

Reply

Sign in to reply online Use email software

Devin Jeanpierre

4:49 p.m.

On Wed, Mar 6, 2013 at 5:40 PM, MRAB <python@mrabarnett.plus.com> wrote:

If there aren't capture groups, then you can use a DFA.

You can use a variant of DFA even if there are capture groups, it just takes more effort.

The re and regex modules use NFA because of the various other features required.

I take it you mean that both the re and regex modules use backtracking search. I was asking whether or not it can reach the exponential time worst-case on regexps without capture groups. If the answer is no, as you seem to be implying, then how does it prevent a DOS attack? -- Devin

Reply

Sign in to reply online Use email software

MRAB

5:25 p.m.

On 2013-03-06 22:49, Devin Jeanpierre wrote:

Yes.

You _can_ have catastrophic backtracking without capture groups. You've already seen an example in ".*a.*". It gets worse when you can have repeated repeats, for example "(?:.*)*". The difference with fnmatch is that you don't care _where_ various parts match (there are no capture groups), only _whether_ it matches, and then only whether or not _all_ of it matches.

Reply

Sign in to reply online Use email software

Devin Jeanpierre

9:23 p.m.

On Wed, Mar 6, 2013 at 6:25 PM, MRAB <python@mrabarnett.plus.com> wrote:

We seem to be talking past each other. I already know all this. I am asking you to justify your claim that if glob was based on regex, instead of re, it would be free of DOS attacks. Because of your confusion, I expect you didn't really mean to claim that. I inferred it because when you were asked for an approach that would solve DOS attacks against glob, you replied by saying that you wrote a regex module that is more resistant to such things. I apologize if I misunderstood. -- Devin

Reply

Sign in to reply online Use email software

MRAB

5:50 a.m.

On 07/03/2013 03:23, Devin Jeanpierre wrote:

I didn't say that it should be based on regex. What I meant was that it didn't seem that difficult compared to the regex module. That module is more resistant to catastrophic backtracking and some of its tricks could be used for the much simpler fnmatch to make a new implementation of _that_ more resistant to the problem. I'm currently thinking about the details.

Reply

Sign in to reply online Use email software

Antoine Pitrou

5:58 a.m.

Le Thu, 07 Mar 2013 11:50:31 +0000, MRAB <python@mrabarnett.plus.com> a écrit :

Keep in mind it shouldn't slow down the general (non-hostile) use case. Regards Antoine.

Reply

Sign in to reply online Use email software

Greg Ewing

5:16 p.m.

MRAB wrote:

It shouldn't be necessary to use tricks. A glob pattern describes a regular language, which can always be parsed using a DFA with no backtracking at all. -- Greg

Reply

Sign in to reply online Use email software

MRAB

8:26 p.m.

On 2013-03-06 17:51, MRAB wrote:

Here's a very simple, all-Python, implementation I've just cooked up: def fnmatch(name, pattern): positions = [(0, 0)] while positions: name_pos, pattern_pos = positions.pop() if pattern_pos >= len(pattern): if name_pos >= len(name): return True elif pattern[pattern_pos] == '*': if pattern_pos == len(pattern) - 1: return True positions.append((name_pos, pattern_pos + 1)) if name_pos < len(name): positions.append((name_pos + 1, pattern_pos)) elif pattern[pattern_pos] == '?': if name_pos < len(name): positions.append((name_pos + 1, pattern_pos + 1)) elif pattern[pattern_pos] == '[': if name_pos < len(name): negative = pattern[pattern_pos + 1] == "!" pattern_pos += 2 if negative else 1 close_pos = pattern.find(']', pattern_pos) if close_pos >= 0: if (name[name_pos] in pattern[pattern_pos : close_pos]) != negative: positions.append((name_pos + 1, close_pos + 1)) elif name_pos < len(name) and name[name_pos] == pattern[pattern_pos]: positions.append((name_pos + 1, pattern_pos + 1)) return False

Reply

Sign in to reply online Use email software

Devin Jeanpierre

9:48 p.m.

On Wed, Mar 6, 2013 at 9:26 PM, MRAB <python@mrabarnett.plus.com> wrote:

Because positions is never culled of duplicate states, this suffers the exact same problem. In this case, fnmatch('a'*50, '*a*'*50) returns in 6500 years instead of 200. If you want to solve it, you should either affix it with a memoization cache, or use Thompson's algorithm instead of backtracking search. Since this isn't recursive, memoization is a bit annoying, though, so instead I modified it below to use Thompson's algorithm on NFA (with no error checking though): def fnmatch(name, pattern): positions = {0} for char in name: new_positions = set() for pattern_pos in positions: if pattern_pos >= len(pattern): continue pattern_char = pattern[pattern_pos] if pattern_char == '*': if pattern_pos == len(pattern) - 1: return True new_positions.update([pattern_pos, pattern_pos + 1]) elif pattern_char == '?': new_positions.add(pattern_pos + 1) elif pattern[pattern_pos] == '[': negative = pattern[pattern_pos + 1] == "!" pattern_pos += 2 if negative else 1 close_pos = pattern.index(']', pattern_pos) if (char in pattern[pattern_pos : close_pos]) != negative: new_positions.add(close_pos + 1) elif char == pattern_char: new_positions.add(pattern_pos + 1) positions = new_positions return len(pattern) in positions Backseatingly yours, -- Devin

Reply

Sign in to reply online Use email software

Charles-François Natali

2:23 a.m.

This looks really promising. Could one of you open an issue on the tracker and attach a patch? Note that there's a problem with a current implementation: AssertionError: False is not true : expected 'abc' to match pattern '???*'

Reply

Sign in to reply online Use email software

Devin Jeanpierre

3:01 a.m.

On Thu, Mar 7, 2013 at 3:23 AM, Charles-François Natali <cf.natali@gmail.com> wrote:

This looks really promising. Could one of you open an issue on the tracker and attach a patch?

The hard work is in making this a C extension module, which I don't know much about. I could do it this weekend if there's any chance at all it'd be accepted.

Note that there's a problem with a current implementation: AssertionError: False is not true : expected 'abc' to match pattern '???*'

My bad. I always make that mistake. :( --- def eps_closure(pattern, poses): for pos in poses: while pos < len(pattern) and pattern[pos] == '*': yield pos pos += 1 yield pos def fnmatch(name, pattern): positions = set(eps_closure(pattern, {0})) for char in name: new_positions = set() for pattern_pos in positions: if pattern_pos >= len(pattern): continue pattern_char = pattern[pattern_pos] if pattern_char == '*': if pattern_pos == len(pattern) - 1: return True new_positions.update([pattern_pos, pattern_pos + 1]) elif pattern_char == '?': new_positions.add(pattern_pos + 1) elif pattern[pattern_pos] == '[': negative = pattern[pattern_pos + 1] == "!" pattern_pos += 2 if negative else 1 close_pos = pattern.index(']', pattern_pos) if (char in pattern[pattern_pos : close_pos]) != negative: new_positions.add(close_pos + 1) elif char == pattern_char: new_positions.add(pattern_pos + 1) positions = set(eps_closure(pattern, new_positions)) return len(pattern) in positions -- Devin

Reply

Sign in to reply online Use email software

MRAB

9:13 a.m.

On 07/03/2013 02:26, MRAB wrote:

[snip] And here's a new implementation: def fnmatch(name, pattern): saved_name_pos, saved_pattern_pos = 1, -1 name_pos, pattern_pos = 0, 0 while True: retry = False if pattern_pos >= len(pattern): if name_pos >= len(name): return True retry = True elif pattern[pattern_pos] == '*': saved_name_pos, saved_pattern_pos = name_pos + 1, pattern_pos pattern_pos += 1 elif pattern[pattern_pos] == '?': if name_pos < len(name): name_pos += 1 pattern_pos += 1 else: retry = True elif pattern[pattern_pos] == '[': if name_pos < len(name): negative = pattern[pattern_pos + 1] == "!" pattern_pos += 2 if negative else 1 close_pos = pattern.find(']', pattern_pos) if close_pos >= 0: if (name[name_pos] in pattern[pattern_pos : close_pos]) != negative: name_pos += 1 pattern_pos = close_pos + 1 else: retry = True else: retry = True else: retry = True elif name_pos < len(name) and name[name_pos] == pattern[pattern_pos]: name_pos += 1 pattern_pos += 1 else: retry = True if retry: if saved_pattern_pos < 0: return False name_pos, pattern_pos = saved_name_pos, saved_pattern_pos

Reply

Sign in to reply online Use email software

Charles-François Natali

9:28 a.m.

And here's a new implementation:

Thanks. Could you post this on the tracker? Patches sent to mailing lists tend to get lost, and we've already hijacked Antoine's thread long enough :-)

Reply

Sign in to reply online Use email software

MRAB

10:32 a.m.

On 07/03/2013 15:28, Charles-François Natali wrote:

Done, and apologies to Antoine. :-)

Reply

Sign in to reply online Use email software

Charles-François Natali

March 2013

8:46 a.m.

Hello,

Hi,

I've updated PEP 428 following the previous discussion.

I really look forward to PEP 428 acceptance :-) Just a couple remarks:

...
...
p = PureNTPath('c:/Downloads/pathlib.tar.gz') p.name 'pathlib.tar.gz' p.basename 'pathlib.tar' p.suffix '.gz'

I find the 'p.basename' name confusing: following POSIX conventions, 'basename' should be 'pathlib.tar.gz'. I don't have another 'name' to propose for the stripped name, though.

match() matches the path against a glob pattern:

I think it could be interesting to add an optional argument to mitigate glob-based DoS. Whether this should be made default is left as an exercise to the reader :-) cf

Reply

Sign in to reply online Use email software

Oleg Broytman

9:14 a.m.

Hi! On Sun, Mar 03, 2013 at 09:46:16AM +0100, Charles-Fran?ois Natali <cf.natali@gmail.com> wrote:

Yes, and ntpath.py/posixpath.py follow this terminology.

I don't have another 'name' to propose for the stripped name, though.

ntpath.py/posixpath.py name these parts "root" and "extension". "Root", of course, has its own share of different connotations. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

Reply

Sign in to reply online Use email software

Antoine Pitrou

10:41 a.m.

On Sun, 3 Mar 2013 09:46:16 +0100 Charles-François Natali <cf.natali@gmail.com> wrote:

Yes. We could call it "root" (http://en.wikipedia.org/wiki/Root_%28linguistics%29) but in this context it would be confusing. Also, it's not exactly the root since as you point there can still be a remaining suffix. There's "stem", too (http://en.wikipedia.org/wiki/Word_stem). With the same provision about not being the actual stem.

You mean for glob() (match() is just a regex-like matcher, it doesn't do any I/O). Yes, I think we could add a `allow_recursive` argument. Is there any other DoS issue? Regards Antoie.

Reply

Sign in to reply online Use email software

Charles-François Natali

2:12 p.m.

Yes. We could call it "root" (http://en.wikipedia.org/wiki/Root_%28linguistics%29) but in this context it would be confusing. Also, it's not exactly the root since as you point there can still be a remaining suffix.

Indeed, "root" would be even more confusing.

There's "stem", too (http://en.wikipedia.org/wiki/Word_stem). With the same provision about not being the actual stem.

Also, it doesn't sound familiar (at least to me). How about "rootname", or "stripped_name" (the last one is a little too long)?

You mean for glob() (match() is just a regex-like matcher, it doesn't do any I/O).

Yes, I meant glob() (fnmatch() implementations can also be subject to DoS through stack exhaustion, but Python's implementation is based on regex).

Yes, I think we could add a `allow_recursive` argument. Is there any other DoS issue?

If by recursive you mean the '**' pattern (cross-directory match), then I'm afraid that's not enough. For example, a pattern like '*/../*/../*/../*/../*' would have the same problem: """ $ mkdir -p /tmp/foo/a /tmp/foo/b $ ~/python/cpython/python -c "from pathlib import *; p = Path('/tmp/foo'); print(list(p.glob('*/../*/../*/../*')))" [PosixPath('/tmp/foo/a/../a/../a/../a'), PosixPath('/tmp/foo/a/../a/../a/../b'), PosixPath('/tmp/foo/a/../a/../b/../a'), PosixPath('/tmp/foo/a/../a/../b/../b'), PosixPath('/tmp/foo/a/../b/../a/../a'), PosixPath('/tmp/foo/a/../b/../a/../b'), PosixPath('/tmp/foo/a/../b/../b/../a'), PosixPath('/tmp/foo/a/../b/../b/../b'), PosixPath('/tmp/foo/b/../a/../a/../a'), PosixPath('/tmp/foo/b/../a/../a/../b'), PosixPath('/tmp/foo/b/../a/../b/../a'), PosixPath('/tmp/foo/b/../a/../b/../b'), PosixPath('/tmp/foo/b/../b/../a/../a'), PosixPath('/tmp/foo/b/../b/../a/../b'), PosixPath('/tmp/foo/b/../b/../b/../a'), PosixPath('/tmp/foo/b/../b/../b/../b')] """ cf

Reply

Sign in to reply online Use email software

Jan Kaliszewski

11:33 p.m.

Hello, 1. Ad: >>> PurePosixPath('/usr/bin/python').relative('/etc') Traceback (most recent call last): ... ValueError: ... Shouldn't this particular operation return "PurePosixPath('/etc/../usr/bin/python')"? 2. 03.03.2013 15:12, Charles-François Natali wrote:

Maybe simply "stripped"? Or "stemname"? Or "unsuffixed"?... Anyway "basename" is IMHO a bad idea because in os.path it is already used for something completely different. Cheers. *j

Reply

Sign in to reply online Use email software

Jan Kaliszewski

11:39 p.m.

Me wrote:

Shouldn't this particular operation return "PurePosixPath('/etc/../usr/bin/python')"?

Pardon, I ment: "PurePosixPath('../usr/bin/python')". *j

Reply

Sign in to reply online Use email software

Antoine Pitrou

March 2013

7:23 a.m.

On Tue, 05 Mar 2013 00:33:48 +0100 Jan Kaliszewski <zuo@chopin.edu.pl> wrote:

Think what happens if /etc is a symlink to /var/etc. (not very likely to happen for /etc, but likely to happen in the general case) Regards Antoine.

Reply

Sign in to reply online Use email software

Serhiy Storchaka

2:58 p.m.

On 05.03.13 09:23, Antoine Pitrou wrote:

posixpath.relpath('/usr/bin/python', '/etc') returns '../usr/bin/python'. Perhaps pathlib should have an option to provide such compatible behavior. P.S. Pathlib implementation has relative_to() method. relative() method exists too but looks as unrelated.

Reply

Sign in to reply online Use email software

Antoine Pitrou

10:24 a.m.

Le Tue, 05 Mar 2013 16:58:57 +0200, Serhiy Storchaka <storchaka@gmail.com> a écrit :

I don't think so, since the behaviour is broken in the first place.

P.S. Pathlib implementation has relative_to() method. relative() method exists too but looks as unrelated.

Not in the "pep428" branch. Regards Antoine.

Reply

Sign in to reply online Use email software

Antoine Pitrou

9:16 a.m.

Le Sun, 3 Mar 2013 15:12:09 +0100, Charles-François Natali <cf.natali@gmail.com> a écrit :

"rootname" is confusing because of filesystem roots, and the second is too long (not to mention it's not obvious what has been stripped). I really prefer "basename" or, if people are hostile, "stem".

Mmmh, I don't know how to guard against that. Perhaps by disallowing ".." in glob patterns? But the problem could still appear with symlinks. To be honest I don't think allowing untrusted users to specify a glob pattern is a very good idea. On the other hand, for the common use cases such as configuration files, the user should be trustable. Regards Antoine.

Reply

Sign in to reply online Use email software

Ethan Furman

3:23 p.m.

On 03/05/2013 01:16 AM, Antoine Pitrou wrote:

+1 -- ~Ethan~

Reply

Sign in to reply online Use email software

Andrew Barnert

8:29 p.m.

Correct me if I'm remembering wrong, but the posix basename(1) tool can strip both dirnames _and_ extensions. So, any confusion here has a solid precedent. Sent from a random iPhone On Mar 5, 2013, at 7:23, Ethan Furman <ethan@stoneleaf.us> wrote:

Reply

Sign in to reply online Use email software

random832＠fastmail.us

March 2013

9:45 a.m.

On Tue, Mar 5, 2013, at 15:29, Andrew Barnert wrote:

Yes, but it requires you to pass in the extension. So what about p = '/foo/pathlib.tar.gz' p.basename() == 'pathlib.tar.gz' p.basename('.gz') == 'pathlib.tar' p.basename('.tar.gz') == 'pathlib' p.basename('.a') exception? 'pathlib.tar.gz'? p.basename(True) == 'pathlib.tar' p.basename(1) == 'pathlib.tar' p.basename(2) == 'pathlib' ?

Reply

Sign in to reply online Use email software

Andrew Barnert

12:28 p.m.

On Mar 6, 2013, at 7:45, random832@fastmail.us wrote:

I was just pointing out that the name already does double duty, and therefore it's a few decades too late for people to complain about confusion, not suggesting that we emulate it. But now that I see your examples, it's not a bad idea. Anyone familiar with unix will immediately understand what basename('.gz') is doing and appreciate the shortcut, and the alternate forms you provided aren't surprising. The problem is that I think people will still want a stripext/stem/root/whatever function that removes the extension without also removing the dirname. In fact, I've got a script here where one of my coworkers has written $(dirname $z)/$(basename $z .zip) to get around it. And I don't think p.basename(True, False) is a viable answer. Besides, what names would you give the two flags? If there were good names for those, there would be good names for the two separate functions, right? Personally, I don't have a problem with the stem or stripext suggestions (although to me, the former implies stripping all extensions, the latter just one), but I guess they haven't gotten much traction.

Reply

Sign in to reply online Use email software

random832＠fastmail.us

1:15 p.m.

On Wed, Mar 6, 2013, at 13:28, Andrew Barnert wrote:

Well, whatever you do is competing with p[:-len('.gz')] (and in shell, your co-worker could have done ${z%.zip} in ksh/bash/POSIX) - unless you have an example of an OS where the 'extension' component doesn't simply append at the end (well, I suppose they're not case-sensitive on windows - and how much weird long/short filename stuff does pathlib do on windows?), I'm not sure how useful a function for this is. Although, I was actually half-tempted to suggest p.basename - p.extension and propose subtraction as a general "remove matching suffix" operation on strings (or even without that, it could be defined on a class that .basename returns) I'm not sure what the use case is for doing this without already knowing the extension, to be honest. This isn't an operation the basename(1) tool supports, either [well, you could do $(basename $x .${x##*.}), but it doesn't support it by itself] -- Random832

Reply

Sign in to reply online Use email software

Andrew Barnert

3:18 p.m.

From: "random832@fastmail.us" <random832@fastmail.us> Sent: Wednesday, March 6, 2013 11:15 AM

On Wed, Mar 6, 2013, at 13:28, Andrew Barnert wrote:

...
The problem is that I think people will still want a stripext/stem/root/whatever function that removes the extension without also removing the dirname. In fact, I've got a script here where one of my coworkers has written $(dirname $z)/$(basename $z .zip) to get around it. …

Well, whatever you do is competing with p[:-len('.gz')]

Only the "remove extension by name" is competing with that. The "remove any and all extensions" or "remove 1 extension" or "remove n extensions" cases—which I think are more common—are not. Of course they _are_ competing with various uses of partition/rpartition/split/rsplit. But you could make the same argument for everything in a path library—p.dirname() is competing with p.rpartition(pathsep)[0], and 'p / 'foo' is competing with p + pathsep + 'foo', and so on. As I understand it, the reason to have a path API is to provide a "one way to do it" that's obvious, readable, and hard to get wrong. And I think p.stripext('.gz') or p.stripext(2) or p.stripext() are better than p[:-len('.gz')] or p.rsplit('.', 2)[0] or p.partition('.')[0] in that regard. Except for the name, which nobody's come up with a good suggestion for, and the fact that we probably don't want to cram all three versions of the operation into one function, or even necessarily support all three at all.

(and in shell, your co-worker could have done ${z%.zip} in ksh/bash/POSIX) - unless you have an example of an OS where the 'extension' component doesn't simply append at the end (well, I suppose they're not case-sensitive on windows - and how much weird long/short filename stuff does pathlib do on windows?)

Funnily enough, according to a comment, this was intended to handle all-caps .ZIP on Windows under cygwin. I don't know if it actually works (or, for that matter, if %.zip would have worked), as I don't have a Windows box with cygwin installed handy. But that's beside the point. The point is that people want a method that removes extensions (whether by name or otherwise) that doesn't also remove dirnames. And if you force them to come up with such a method themselves, they're not necessarily going to come up with a good one. I'm sure if we had your extended basename but no stripext, someone would write p.dirname()/p.basename('.zip').

Reply

Sign in to reply online Use email software

MRAB

4:14 p.m.

On 2013-03-06 21:18, Andrew Barnert wrote:

From: "random832@fastmail.us" <random832@fastmail.us>

Sent: Wednesday, March 6, 2013 11:15 AM

...
On Wed, Mar 6, 2013, at 13:28, Andrew Barnert wrote:

...
The problem is that I think people will still want a stripext/stem/root/whatever function that removes the extension without also removing the dirname. In fact, I've got a script here where one of my coworkers has written $(dirname $z)/$(basename $z .zip) to get around it. …

...
Well, whatever you do is competing with p[:-len('.gz')]

Only the "remove extension by name" is competing with that. The "remove any and all extensions" or "remove 1 extension" or "remove n extensions" cases—which I think are more common—are not. Of course they _are_ competing with various uses of partition/rpartition/split/rsplit. But you could make the same argument for everything in a path library—p.dirname() is competing with p.rpartition(pathsep)[0], and 'p / 'foo' is competing with p + pathsep + 'foo', and so on.

As I understand it, the reason to have a path API is to provide a "one way to do it" that's obvious, readable, and hard to get wrong. And I think p.stripext('.gz') or p.stripext(2) or p.stripext() are better than p[:-len('.gz')] or p.rsplit('.', 2)[0] or p.partition('.')[0] in that regard.

Except for the name, which nobody's come up with a good suggestion for, and the fact that we probably don't want to cram all three versions of the operation into one function, or even necessarily support all three at all.

...
(and in shell, your co-worker could have done ${z%.zip} in ksh/bash/POSIX) - unless you have an example of an OS where the 'extension' component doesn't simply append at the end (well, I suppose they're not case-sensitive on windows - and how much weird long/short filename stuff does pathlib do on windows?)

Funnily enough, according to a comment, this was intended to handle all-caps .ZIP on Windows under cygwin. I don't know if it actually works (or, for that matter, if %.zip would have worked), as I don't have a Windows box with cygwin installed handy. But that's beside the point. The point is that people want a method that removes extensions (whether by name or otherwise) that doesn't also remove dirnames. And if you force them to come up with such a method themselves, they're not necessarily going to come up with a good one. I'm sure if we had your extended basename but no stripext, someone would write p.dirname()/p.basename('.zip').

For some reason, p.basename('.zip') feels 'wrong' to me, but p.basename() feels OK. Perhaps it's because it's not immediately obvious what the optional argument is for (or, at least, that's how it seems to me!). On the other hand, p.stripext('.zip') feels OK because it suggests that you're stripping off the '.zip' extension (compare with str.strip), but would that mean that you'd expect p.stripext() also to strip off the extension, whatever it was?

Reply

Sign in to reply online Use email software

Andrew Barnert

9:15 p.m.

On Mar 6, 2013, at 14:14, MRAB <python@mrabarnett.plus.com> wrote:

Given the earlier responses on this thread, I think it's safe to say that plenty of people, like Paul Moore, would either expect it, or complain about its nonexistence. Anyway, I think getting side-tracked on random's suggestion (which, by the way, is almost entirely my fault) has detracted from the main point. Even if we _did_ have an extended basename that can strip both dirname and extension, with every option you could possible desire, people would _still_ want a method that strips the exception only. And calling that method basename (or root) is ambiguous and misleading to people who expect that name to mean pulling off directories rather than extensions. My original point was that Unix created this confusion by overloading the meaning of basename decades ago. The question is whether that's sufficient justification to use the name in python. I don't think it is (again, I'd prefer stem or splitext), but it was at least worth asking.

Reply

Sign in to reply online Use email software

Devin Jeanpierre

March 2013

10:31 a.m.

On Sun, Mar 3, 2013 at 9:12 AM, Charles-François Natali <cf.natali@gmail.com> wrote:

I don't know about stack exhaustion, but Python's regular expression implementation is agonizingly slow in the worst case, and fnmatch inherits this.

...
...
fnmatch.fnmatch('a'*50, '*a*'*50) # weird how the pattern/string order is reversed from re.match

That will take about 200 years to complete with CPython. Maybe a little less, if you're running a particularly fast computer. ;) Is that the sort of DoS issue you are looking for? -- Devin

Reply

Sign in to reply online Use email software

Charles-François Natali

11:08 a.m.

Exactly (the complexity of a typical ad-hoc fnmatch() implementation is the reason some servers like vsftpd use their own version, and it's even worse with a regex-based implementation as you notice). Now, the question is whether we want to try to mitigate this or not...

Reply

Sign in to reply online Use email software

MRAB

11:51 a.m.

On 2013-03-06 17:08, Charles-François Natali wrote:

It's not something I've ever used, but it doesn't look that difficult compared to regex if all it has is "*", "?", "[...]" and "[!...]".

Reply

Sign in to reply online Use email software

Charles-François Natali

1:21 p.m.

What's not difficult? Avoiding DoS with arbitrary glob patterns? If yes, please share your idea :-)

Reply

Sign in to reply online Use email software

Devin Jeanpierre

2:01 p.m.

On Wed, Mar 6, 2013 at 2:21 PM, Charles-François Natali <cf.natali@gmail.com> wrote:

Compile the glob pattern to an NFA and simulate the NFA efficiently (using Thompson's algorithm or even backtracking with memoization, rather than plain backtracking). For example, see: http://swtch.com/~rsc/regexp/regexp1.html -- Devin

Reply

Sign in to reply online Use email software

Charles-François Natali

4:03 p.m.

Well, I'm sure it would be much better, but that would be a rather large piece of code, which would probably belong to a new regex engine, no?

Yes, I know about your regex implementation, but it looks like it's not going in anytime soon. Do you think you could come up with a reasonable - i.e. self-contained - patch for fnmatch?

Reply

Sign in to reply online Use email software

Devin Jeanpierre

March 2013

4:22 p.m.

On Wed, Mar 6, 2013 at 5:03 PM, Charles-François Natali <cf.natali@gmail.com> wrote:

It's a weekend worth of code at most. A simple regex engine is trivial, and glob is even simpler than that. (For example, we can use the glob pattern itself as the NFA during simulation). -- Devin

Reply

Sign in to reply online Use email software

MRAB

4:24 p.m.

On 2013-03-06 22:03, Charles-François Natali wrote:

I think it's worth looking at!

Reply

Sign in to reply online Use email software

MRAB

2:20 p.m.

On 2013-03-06 19:21, Charles-François Natali wrote:

I wrote an alternative regex implementation (it's on PyPI). It's a lot more resistant to catastrophic backtracking. What's need for fnmatch is a lot simpler than that! :-)

Reply

Sign in to reply online Use email software

Devin Jeanpierre

4:35 p.m.

On Wed, Mar 6, 2013 at 3:20 PM, MRAB <python@mrabarnett.plus.com> wrote:

I wrote an alternative regex implementation (it's on PyPI). It's a lot more resistant to catastrophic backtracking.

How resistant? If it's possible to have catastrophic backtracking even if groups and backreferences aren't involved, then it wouldn't solve the fnmatch DOS problem. -- Devin

Reply

Sign in to reply online Use email software

MRAB

4:40 p.m.

On 2013-03-06 22:35, Devin Jeanpierre wrote:

If there aren't capture groups, then you can use a DFA. The re and regex modules use NFA because of the various other features required.

Reply

Sign in to reply online Use email software

Devin Jeanpierre

4:49 p.m.

On Wed, Mar 6, 2013 at 5:40 PM, MRAB <python@mrabarnett.plus.com> wrote:

If there aren't capture groups, then you can use a DFA.

You can use a variant of DFA even if there are capture groups, it just takes more effort.

The re and regex modules use NFA because of the various other features required.

I take it you mean that both the re and regex modules use backtracking search. I was asking whether or not it can reach the exponential time worst-case on regexps without capture groups. If the answer is no, as you seem to be implying, then how does it prevent a DOS attack? -- Devin

Reply

Sign in to reply online Use email software

MRAB

March 2013

11:25 p.m.

On 2013-03-06 22:49, Devin Jeanpierre wrote:

Yes.

You _can_ have catastrophic backtracking without capture groups. You've already seen an example in ".*a.*". It gets worse when you can have repeated repeats, for example "(?:.*)*". The difference with fnmatch is that you don't care _where_ various parts match (there are no capture groups), only _whether_ it matches, and then only whether or not _all_ of it matches.

Reply

Sign in to reply online Use email software

Devin Jeanpierre

3:23 a.m.

On Wed, Mar 6, 2013 at 6:25 PM, MRAB <python@mrabarnett.plus.com> wrote:

We seem to be talking past each other. I already know all this. I am asking you to justify your claim that if glob was based on regex, instead of re, it would be free of DOS attacks. Because of your confusion, I expect you didn't really mean to claim that. I inferred it because when you were asked for an approach that would solve DOS attacks against glob, you replied by saying that you wrote a regex module that is more resistant to such things. I apologize if I misunderstood. -- Devin

Reply

Sign in to reply online Use email software

MRAB

11:50 a.m.

On 07/03/2013 03:23, Devin Jeanpierre wrote:

I didn't say that it should be based on regex. What I meant was that it didn't seem that difficult compared to the regex module. That module is more resistant to catastrophic backtracking and some of its tricks could be used for the much simpler fnmatch to make a new implementation of _that_ more resistant to the problem. I'm currently thinking about the details.

Reply

Sign in to reply online Use email software

Antoine Pitrou

11:58 a.m.

Le Thu, 07 Mar 2013 11:50:31 +0000, MRAB <python@mrabarnett.plus.com> a écrit :

Keep in mind it shouldn't slow down the general (non-hostile) use case. Regards Antoine.

Reply

Sign in to reply online Use email software

Greg Ewing

11:16 p.m.

MRAB wrote:

It shouldn't be necessary to use tricks. A glob pattern describes a regular language, which can always be parsed using a DFA with no backtracking at all. -- Greg

Reply

Sign in to reply online Use email software

MRAB

2:26 a.m.

On 2013-03-06 17:51, MRAB wrote:

Here's a very simple, all-Python, implementation I've just cooked up: def fnmatch(name, pattern): positions = [(0, 0)] while positions: name_pos, pattern_pos = positions.pop() if pattern_pos >= len(pattern): if name_pos >= len(name): return True elif pattern[pattern_pos] == '*': if pattern_pos == len(pattern) - 1: return True positions.append((name_pos, pattern_pos + 1)) if name_pos < len(name): positions.append((name_pos + 1, pattern_pos)) elif pattern[pattern_pos] == '?': if name_pos < len(name): positions.append((name_pos + 1, pattern_pos + 1)) elif pattern[pattern_pos] == '[': if name_pos < len(name): negative = pattern[pattern_pos + 1] == "!" pattern_pos += 2 if negative else 1 close_pos = pattern.find(']', pattern_pos) if close_pos >= 0: if (name[name_pos] in pattern[pattern_pos : close_pos]) != negative: positions.append((name_pos + 1, close_pos + 1)) elif name_pos < len(name) and name[name_pos] == pattern[pattern_pos]: positions.append((name_pos + 1, pattern_pos + 1)) return False

Reply

Sign in to reply online Use email software

Devin Jeanpierre

March 2013

9:48 p.m.

On Wed, Mar 6, 2013 at 9:26 PM, MRAB <python@mrabarnett.plus.com> wrote:

Because positions is never culled of duplicate states, this suffers the exact same problem. In this case, fnmatch('a'*50, '*a*'*50) returns in 6500 years instead of 200. If you want to solve it, you should either affix it with a memoization cache, or use Thompson's algorithm instead of backtracking search. Since this isn't recursive, memoization is a bit annoying, though, so instead I modified it below to use Thompson's algorithm on NFA (with no error checking though): def fnmatch(name, pattern): positions = {0} for char in name: new_positions = set() for pattern_pos in positions: if pattern_pos >= len(pattern): continue pattern_char = pattern[pattern_pos] if pattern_char == '*': if pattern_pos == len(pattern) - 1: return True new_positions.update([pattern_pos, pattern_pos + 1]) elif pattern_char == '?': new_positions.add(pattern_pos + 1) elif pattern[pattern_pos] == '[': negative = pattern[pattern_pos + 1] == "!" pattern_pos += 2 if negative else 1 close_pos = pattern.index(']', pattern_pos) if (char in pattern[pattern_pos : close_pos]) != negative: new_positions.add(close_pos + 1) elif char == pattern_char: new_positions.add(pattern_pos + 1) positions = new_positions return len(pattern) in positions Backseatingly yours, -- Devin

Reply

Sign in to reply online Use email software

Charles-François Natali

2:23 a.m.

This looks really promising. Could one of you open an issue on the tracker and attach a patch? Note that there's a problem with a current implementation: AssertionError: False is not true : expected 'abc' to match pattern '???*'

Reply

Sign in to reply online Use email software

Devin Jeanpierre

3:01 a.m.

On Thu, Mar 7, 2013 at 3:23 AM, Charles-François Natali <cf.natali@gmail.com> wrote:

This looks really promising. Could one of you open an issue on the tracker and attach a patch?

The hard work is in making this a C extension module, which I don't know much about. I could do it this weekend if there's any chance at all it'd be accepted.

Note that there's a problem with a current implementation: AssertionError: False is not true : expected 'abc' to match pattern '???*'

My bad. I always make that mistake. :( --- def eps_closure(pattern, poses): for pos in poses: while pos < len(pattern) and pattern[pos] == '*': yield pos pos += 1 yield pos def fnmatch(name, pattern): positions = set(eps_closure(pattern, {0})) for char in name: new_positions = set() for pattern_pos in positions: if pattern_pos >= len(pattern): continue pattern_char = pattern[pattern_pos] if pattern_char == '*': if pattern_pos == len(pattern) - 1: return True new_positions.update([pattern_pos, pattern_pos + 1]) elif pattern_char == '?': new_positions.add(pattern_pos + 1) elif pattern[pattern_pos] == '[': negative = pattern[pattern_pos + 1] == "!" pattern_pos += 2 if negative else 1 close_pos = pattern.index(']', pattern_pos) if (char in pattern[pattern_pos : close_pos]) != negative: new_positions.add(close_pos + 1) elif char == pattern_char: new_positions.add(pattern_pos + 1) positions = set(eps_closure(pattern, new_positions)) return len(pattern) in positions -- Devin

Reply

Sign in to reply online Use email software

MRAB

9:13 a.m.

On 07/03/2013 02:26, MRAB wrote:

[snip] And here's a new implementation: def fnmatch(name, pattern): saved_name_pos, saved_pattern_pos = 1, -1 name_pos, pattern_pos = 0, 0 while True: retry = False if pattern_pos >= len(pattern): if name_pos >= len(name): return True retry = True elif pattern[pattern_pos] == '*': saved_name_pos, saved_pattern_pos = name_pos + 1, pattern_pos pattern_pos += 1 elif pattern[pattern_pos] == '?': if name_pos < len(name): name_pos += 1 pattern_pos += 1 else: retry = True elif pattern[pattern_pos] == '[': if name_pos < len(name): negative = pattern[pattern_pos + 1] == "!" pattern_pos += 2 if negative else 1 close_pos = pattern.find(']', pattern_pos) if close_pos >= 0: if (name[name_pos] in pattern[pattern_pos : close_pos]) != negative: name_pos += 1 pattern_pos = close_pos + 1 else: retry = True else: retry = True else: retry = True elif name_pos < len(name) and name[name_pos] == pattern[pattern_pos]: name_pos += 1 pattern_pos += 1 else: retry = True if retry: if saved_pattern_pos < 0: return False name_pos, pattern_pos = saved_name_pos, saved_pattern_pos

Reply

Sign in to reply online Use email software

Charles-François Natali

9:28 a.m.

And here's a new implementation:

Thanks. Could you post this on the tracker? Patches sent to mailing lists tend to get lost, and we've already hijacked Antoine's thread long enough :-)

Reply

Sign in to reply online Use email software

MRAB

10:32 a.m.

On 07/03/2013 15:28, Charles-François Natali wrote:

Done, and apologies to Antoine. :-)

Reply

Sign in to reply online Use email software

4395

Age (days ago)

4400

Last active (days ago)

Download

42 comments

11 participants

tags

participants (11)

Andrew Barnert
Antoine Pitrou
Charles-François Natali
Devin Jeanpierre
Ethan Furman
Greg Ewing
Jan Kaliszewski
MRAB
Oleg Broytman
random832＠fastmail.us
Serhiy Storchaka