[Python-ideas] Re: Regex timeouts

Feb. 15, 2022

      Tim Peters wrote:
...
"""
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.”  Now they have two problems.
- Jamie Zawinski
"""
Maybe so, but I'm committed now :).  I have dozens of regexes to parse specific log messages I'm interested in. I made a little DSL that uses regexes with capture groups, and if the regex matches, takes the resulting groupdict and optionally applies further transformations on the individual fields. This allows me to very concisely specify what I want to extract before doing further analysis and aggregation on the resulting fields.  For example:

flush_end = Rule(
    Capture(
        # Completed flushing /u01/data02/tb_tbi_project02_prd/data_launch_index-4a5f72725b7211eaab635720a1b8a299/aa-26507-bti-Data.db (46.528MiB) for commitlog position CommitLogPosition(segmentId=1615955816662, position=223538288)
        # Completed flushing /dse/data02/OpsCenter/rollup_state-7b621931ab7511e8b862810a639403e5/bb-21969-bti-Data.db (7.763MiB/2.197MiB on disk/1 files) for commitlog position CommitLogPosition(segmentId=1637403836277, position=9927158)
        r"Completed flushing (?P<sstable>[^ ]+) \((?P<bytes_flushed>[^)/]+)(/(?P<bytes_on_disk>[^ ]+) on disk/(?P<file_count>[^ ]+) files)?\) for commitlog position CommitLogPosition\(segmentId=(?P<commitlog_segment>[^,]+), position=(?P<commitlog_position>[^)]+)\)"
    ),
    Convert(
        normval,
        "bytes_flushed",
        "bytes_on_disk",
        "commitlog_segment",
        "commitlog_position",
    ),
    table_from_sstable,
)

I know there are specialized tools like logstash but it's nice to be able to specify the extraction and subsequent analysis together in Python.
...
reason to change that. Naive regexps are both clumsy and prone to bad
timing in many tasks that "should be" very easy to express. For
example, "now match up to the next occurrence of 'X'". In SNOBOL and
Icon, that's trivial. 75% of regexp users will write ".*X", with scant
understanding that it may match waaaay more than they intended.
Another 20% will write ".*?X", with scant understanding that may
extend beyond _just_ "the next" X in some cases. That leaves the happy
5% who write "[^X]*X", which finally says what they intended from the
start.
If you look in my regex in the example above, you will see that the "[^X]*X" is exactly what I did. The pathological case arose from a simple typo where I had an extra + after a capture group that I failed to notice, and which somehow worked correctly on the expected input but ran forever when the expected terminating character appeared more times than expected in the input string.

[Python-ideas] Re: Regex timeouts

J.B. Langston