Matching strings with regexps

Tim Peters tim_one at email.msn.com
Sat Sep 11 20:41:12 EDT 1999


The introduction of minimal-match quantifiers has created a new rash of bad
ways to try to match strings with regular expressions.  "bad" == they don't
match what they're intended to match, take a depressingly long time to
match, and/or take a depressingly long time to fail to match when looking at
non-strings.

Following are the only ways to match strings you'll be happy with, because
they're the best ways.  "best" == they match what they're supposed to match,
they match strings quickly without internal backtracking, and they detect
non-strings quickly without internal backtracking either.

They're all alternative-free instances of what Jeffrey Friedl calls
"unrolling" in his book "Mastering Regular Expressions", which see for a
thorough explanation:

    normal* (?: special normal* )*

The regexps below are intended to be compiled via re.compile, passing only
re.VERBOSE.  Wrap them in a triple-quoted r-string first.

give-a-man-a-meal-and-you've-fed-him-for-a-day-
    give-a-man-a-regexp-and-you-may-as-well-kill-him-ly y'rs  - tim


Double-quoted, all the usual backslash escapes allowed, including
backslash-newline to span lines (these are Python/C strings):

    " [^"\\\n]* (?: \\[\000-\377] [^"\\\n]* )* "

Double-quoted, can't span newlines, but the other usual backslash escapes
allowed:

    " [^"\\\n]* (?: \\. [^"\\\n]* )* "

Double-quoted, all the usual backslash escapes allowed, but nothing special
needed to span lines (these are the kind of double-quoted strings that
appear in CSV (comma-separated values) files produced by some MS software):

    " [^"\\]* (?: \\[\000-\377] [^"\\]* )* "

Any of the above, but the trailing double-quote may be missing:  replace the
trailing

    "

with

    "?

The trailing quote will be consumed if it's there.  If it's not there, the
remainder of the line will get sucked up for the kind of regexp that doesn't
allow line-spanning without a backslash escape, and the remainder of the
entire string will get sucked up for the kind of regexp that allows
line-spanning without an escape.

Single-quoted:  as above, after s/"/'/g <wink>.






More information about the Python-list mailing list