
On Wed, 16 Feb 2022 at 10:15, Steven D'Aprano <steve@pearwood.info> wrote:
On Tue, Feb 15, 2022 at 11:51:41PM +0900, Stephen J. Turnbull wrote:
scanf just isn't powerful enough. For example, consider parsing user input dates: scanf("%d/%d/%d", &year, &month, &day). This is nice and simple, but handling "2022-02-15" as well requires a bit of thinking and several extra statements in C. In Python, I guess it would probably look something like
year, sep1, month, sep2, day = scanf("%d%c%d%c%d") if not ('/' == sep1 == sep2 or '-' == sep1 == sep2): raise DateFormatUnacceptableError # range checks for month and day go here
Assuming that scanf raises if there is no match, I would probably go with:
Having scanf raise is one option; another option would be to have it return a partial result, which would raise ValueError when unpacked in this simple way. (Partial results are FAR easier to debug than a simple "didn't match", plus they can be extremely useful in some situations.)
try: # Who writes ISO-8601 dates using slashes? day, month, year = scanf("%d/%d/%d") if ALLOW_TWO_DIGIT_YEARS and len(year) == 2: year = "20" + year except ScanError: year, month, day = scanf("%d-%d-%d")
It all depends on what your goal is. Do you want to support multiple different formats (d/m/y, y-m-d, etc)? Do you want one format with multiple options for delimiter? Is it okay if someone mismatches delimiters? Most likely, I'd not care if someone uses y/m-d, but I wouldn't allow d/m/y or m/d/y, so I'd write it like this: year, month, day = scanf("%d%*[-/]%d%*[-/]%d") But realistically, if we're doing actual ISO 8601 date parsing, then *not one of these is correct*, and we should be using an actual ISO 8601 library :) The simple cases like log file parsing are usually consuming the output of exactly one program, so you can mandate the delimiter completely. Here's something that can parse the output of 'git blame': commit, name, y,m,d, h,m,s, tz, line, text = \ scanf("%s (%s %d-%d-%d %d:%d:%d %d %d) %s") (Of course, you should use --porcelain instead, but this is an example.) There's a spectrum of needs, and a spectrum of tools that can fulfil them. At one extreme, simple method calls, the "in" operator, etc - very limited, very fast, easy to read. At the other extreme, full-on language parsers with detailed grammars. In between? Well, sscanf is a bit simpler than regexp, REXX's parse is probably somewhere near sscanf, SNOBOL is probably a bit to the right of regexp, etc, etc, etc. We shouldn't have to stick to a single tool just because it's capable of spanning a wide range.
I think that
year, sep1, month, sep2, day = re.match(r"(\d+)([-/])(\d+)([-/])(\d+)").groups()
might do it (until Tim or Chris tell me that actually is wrong).
Or use \2 as you suggest later on.
Yeah, \2 much more clearly expresses the intent of "take either of these characters, and then match another of that character". ChrisA