
On Mon, Feb 14, 2022 at 05:13:38PM -0600, Tim Peters wrote:
An interesting lesson nobody wants to learn: the original major string-processing language, SNOBOL, had powerful pattern matching but no regexps. Griswold's more modern successor language, Icon, found no reason to change that.
I've been interested in the existence of SNOBOL string scanning for a long time, but I know very little about it. How does it differ from regexes, and why have programming languages pretty much standardised on regexes rather than other forms of string matching?
Naive regexps are both clumsy and prone to bad timing in many tasks that "should be" very easy to express. For example, "now match up to the next occurrence of 'X'". In SNOBOL and Icon, that's trivial. 75% of regexp users will write ".*X", with scant understanding that it may match waaaay more than they intended.
Indeed, I've been bitten by that many times :-)
Another 20% will write ".*?X", with scant understanding that may extend beyond _just_ "the next" X in some cases.
But this surprises me. Do you have an example?
That leaves the happy 5% who write "[^X]*X", which finally says what they intended from the start.
Doesn't that only work if X is literally a single character?
import re string = "This is some spam and extra spam." re.search('[^spam]*spam', string) <re.Match object; span=(11, 17), match='e spam'>
Whereas this seems to do what I expected:
re.search('.*?spam', string) <re.Match object; span=(0, 17), match='This is some spam'>
-- Steve