grimace: a fluent regular expression generator in Python

Roy Smith roy at panix.com
Wed Jul 17 16:22:00 CEST 2013


In article <mailman.4772.1373978931.3114.python-list at python.org>,
 "Anders J. Munch" <2013 at jmunch.dk> wrote:

> The problem with Perl-style regexp notation isn't so much that it's terse - 
> it's 
> that the syntax is irregular (sic) and doesn't follow modern principles for 
> lexical structure in computer languages.

There seem to be three basic ways to denote what's literal and what's 
not.

1) The Python (and C, Java, PHP, Fortran, etc) way, where all text is 
assumed to be evaluated as a language construct, unless explicitly 
quoted to make it a literal.

2) The shell way, where all text is assumed to be literal strings, 
unless explicitly marked with a $ (or other sigil) as a variable.

3) The regex way, where some characters are magic, but only sometimes 
(depending on context), and you just have to know which ones they are, 
and when, and can escape them to make them non-magic if you have to.

Where things get really messy is when you try to embed one language into 
another, such as regexes in Python.  Perl (and awk, from which it 
evolved) solves the problem in its own way by making regexes a built-in 
part of the language syntax.  Python goes in the other direction, and 
says regexes are just strings that you pass around.

> You can get a long way just by ignoring whitespace, putting literals 
> in quotes and allowing embedded comments.  Setting the re.VERBOSE 
> flag achieves two out of three [example elided].

Yup.  Here's a more complex example.  We use this to parse haproxy log 
files (probably going to munged a bit as lines get refolded by news 
software).  That would be insane without verbose mode (some might argue 
it's insane now, but that's another thread).

pattern = re.compile(r'haproxy\[(?P<pid>\d+)]: '
                     r'(?P<client_ip>(\d{1,3}\.){3}\d{1,3}):'
                     r'(?P<client_port>\d{1,5}) '
                   
r'\[(?P<accept_date>\d{2}/\w{3}/\d{4}(:\d{2}){3}\.\d{3})] '
                     r'(?P<frontend_name>\S+) '
                     r'(?P<backend_name>\S+)/'
                     r'(?P<server_name>\S+) '
                     r'(?P<Tq>(-1|\d+))/'
                     r'(?P<Tw>(-1|\d+))/'
                     r'(?P<Tc>(-1|\d+))/'
                     r'(?P<Tr>(-1|\d+))/'
                     r'(?P<Tt>\+?\d+) '
                     r'(?P<status_code>\d{3}) '
                     r'(?P<bytes_read>\d+) '
                     r'(?P<captured_request_cookie>\S+) '
                     r'(?P<captured_response_cookie>\S+) '
                     r'(?P<termination_state>[\w-]{4}) '
                     r'(?P<actconn>\d+)/'
                     r'(?P<feconn>\d+)/'
                     r'(?P<beconn>\d+)/'
                     r'(?P<srv_conn>\d+)/'
                     r'(?P<retries>\d+) '
                     r'(?P<srv_queue>\d+)/'
                     r'(?P<backend_queue>\d+) '
                     r'(\{(?P<request_id>.*?)\} )?'   # Comment this out 
for a stock haproxy (see above)
                     r'(\{(?P<captured_request_headers>.*?)\} )?'
                     r'(\{(?P<captured_response_headers>.*?)\} )?'
                     r'"(?P<http_request>.+)"'
                     )



More information about the Python-list mailing list