[Python-ideas] Re: Regex timeouts

Feb. 16, 2022

      On Wed, 16 Feb 2022 at 10:15, Steven D'Aprano <steve@pearwood.info> wrote:
...
On Tue, Feb 15, 2022 at 11:51:41PM +0900, Stephen J. Turnbull wrote:
...
scanf just isn't powerful enough.  For example, consider parsing user
input dates: scanf("%d/%d/%d", &year, &month, &day).  This is nice and
simple, but handling "2022-02-15" as well requires a bit of thinking
and several extra statements in C.  In Python, I guess it would
probably look something like
year, sep1, month, sep2, day = scanf("%d%c%d%c%d")
    if not ('/' == sep1 == sep2 or '-' == sep1 == sep2):
        raise DateFormatUnacceptableError
    # range checks for month and day go here
Assuming that scanf raises if there is no match, I would probably go
with:
Having scanf raise is one option; another option would be to have it
return a partial result, which would raise ValueError when unpacked in
this simple way. (Partial results are FAR easier to debug than a
simple "didn't match", plus they can be extremely useful in some
situations.)
...
try:
        # Who writes ISO-8601 dates using slashes?
        day, month, year = scanf("%d/%d/%d")
        if ALLOW_TWO_DIGIT_YEARS and len(year) == 2:
            year = "20" + year
    except ScanError:
        year, month, day = scanf("%d-%d-%d")
It all depends on what your goal is. Do you want to support multiple
different formats (d/m/y, y-m-d, etc)? Do you want one format with
multiple options for delimiter? Is it okay if someone mismatches
delimiters?

Most likely, I'd not care if someone uses y/m-d, but I wouldn't allow
d/m/y or m/d/y, so I'd write it like this:

year, month, day = scanf("%d%*[-/]%d%*[-/]%d")

But realistically, if we're doing actual ISO 8601 date parsing, then
*not one of these is correct*, and we should be using an actual ISO
8601 library :) The simple cases like log file parsing are usually
consuming the output of exactly one program, so you can mandate the
delimiter completely. Here's something that can parse the output of
'git blame':

commit, name, y,m,d, h,m,s, tz, line, text = \
    scanf("%s (%s %d-%d-%d %d:%d:%d %d %d) %s")

(Of course, you should use --porcelain instead, but this is an example.)

There's a spectrum of needs, and a spectrum of tools that can fulfil
them. At one extreme, simple method calls, the "in" operator, etc -
very limited, very fast, easy to read. At the other extreme, full-on
language parsers with detailed grammars. In between? Well, sscanf is a
bit simpler than regexp, REXX's parse is probably somewhere near
sscanf, SNOBOL is probably a bit to the right of regexp, etc, etc,
etc. We shouldn't have to stick to a single tool just because it's
capable of spanning a wide range.
...
I think that
year, sep1, month, sep2, day = re.match(r"(\d+)([-/])(\d+)([-/])(\d+)").groups()
might do it (until Tim or Chris tell me that actually is wrong).
Or use \2 as you suggest later on.
Yeah, \2 much more clearly expresses the intent of "take either of
these characters, and then match another of that character".

ChrisA

[Python-ideas] Re: Regex timeouts

Chris Angelico