
On Tue, 15 Feb 2022 at 11:47, Steven D'Aprano <steve@pearwood.info> wrote:
Another 20% will write ".*?X", with scant understanding that may extend beyond _just_ "the next" X in some cases.
But this surprises me. Do you have an example?
Nongreedy means it'll prefer the next X, but it has to be open to checking others.
re.search("a.*?X[^X]*?Y", "zzzabbbXcccXdddYzzz") <re.Match object; span=(3, 16), match='abbbXcccXdddY'>
The X between bbb and ccc won't result in a match, so the .*? has to capture more.
That leaves the happy 5% who write "[^X]*X", which finally says what they intended from the start.
Doesn't that only work if X is literally a single character?
Yes, but if X is actually "spam", then you can probably do other assertions to guarantee the right match. It gets pretty clunky though.
import re string = "This is some spam and extra spam." re.search('[^spam]*spam', string) <re.Match object; span=(11, 17), match='e spam'>
Whereas this seems to do what I expected:
re.search('.*?spam', string) <re.Match object; span=(0, 17), match='This is some spam'>
Yes, and that's fine as long as all you care about is whether it matches. For a simple example like this, it's fine. But this is far from efficient in more complex cases, since it could potentially have to check much deeper into the string. I'm not familiar with SNOBOL, but one thing I am familiar with is C's scanf (or sscanf etc), which is a parallel to printf, the basis for Python's percent formatting. REXX has a PARSE command; different syntax, similar limitations. Either way, it's a zero-backtracking parser that can do a wide variety of left-to-right scanning, which is a useful limitation for a number of cases, and actually only gets in the way very rarely. Here's an example of how it might work in Python: prefix, spam, eggs = sscanf("There's lots of spam and 12 eggs here, that's lots of eggs", "%s lots of %s and %d eggs") The rules are pretty simple: divide up the format into tokens - either a percent marker or a literal string like " lots of " - and match them in sequence. A "%s" can match any string, but only up to the next thing that matches the next token; so the initial "%s" cannot possibly match the "lots of" near the end of the string - the parser won't even consider it. Toss in a few variants like "%[a-z]" which can match any sequence of the characters a-z, "%4s" which must match precisely four characters, and "%*s" which matches without returning the value, and you can do a lot of parsing without ever worrying about exponential parsing time. The REXX equivalent would be: data = "There's lots of spam and 12 eggs here, that's lots of eggs" parse var data prefix " lots of " spam " and " eggs " eggs" but it's been a very very long time since I did any advanced REXX parsing, so I can't remember all the details of what it's capable of. ChrisA