
[Steven D'Aprano <steve@pearwood.info>]
After this thread, I no longer trust that "easy" regexes will do what they "obviously" look like they should do :-(
I'm not trying to be funny or snarky. I *thought* I had a reasonable understanding of regexes, and now I have learned that I don't, and that the regexes I've been writing don't do what I thought they did, and presumedly the only reason they haven't blown up in my face (either performance-wise, or the wrong output) is blind luck.
Reading Friedl's book is a cure for the confusion, but not for the angst ;-) I believe the single most practical addition in recent decades has been the introduction of "possessive quantifiers" This is a variant of the "greedy" quantifiers that does what most people at the start _believe_ they do: one-and-done. After its initial match, backtracking into it fails. So, e.g., \s++ matches the longest string of whitespace at the time, period. Why "++"? Regexps ;-) It's essentially gibberish syntax that previously didn't have a sensible meaning. For example,
regex.search("^x+[a-z]{4}k", "xxxxxk") <regex.Match object; span=(0, 6), match='xxxxxk'>
is what we're used to if we're paying attention: sucking up as many x's as possible fails to match (there's nothing for [a-z]{4} to match except the trailing "k"). But we keep backtracking into it, trying to match one less "x" at a time, until [a-z]{4} finally matches the rightmost 4 x's. But make it possessive and the match as a whole fails right away:
regex.search("^x++[a-z]{4}k", "xxxxxk")
++ refuses to give back any of what it matched the first time. At this point there are probably more regexp engines that support this feature than don't. Python's re does not, but the regex extension does., Cutting unwanted chances for backtracking greatly cuts the chance of stumbling into timing disasters. Where does that leave Python:? Pretty much aging itself into obsolescence. Regexps keep "evolving", it appears Fredrik lost interest in keeping up long before he died, and nobody else has stepped up. regex _has_ kept up, but isn't in the core. So "install regex" is ever more the best advice. Note that just slamming possessive quantifiers into CPython's engine isn't a good approach for more than just the obvious reasons: possessive quantifiers are themselves just syntax sugar (or chili peppers) for one instance of a more general new feature, "atomic groups". Another that's all but a de facto industry standard now, which Python's re doesn't support (but regex does). Putting just part of that in is half-assed.
Now I have *three* problems :-(
You're quite welcome ;-)