[Tutor] Regular expression on python

Wed Apr 15 10:50:04 CEST 2015

--------------------------------------------
On Tue, 4/14/15, Peter Otten <__peter__ at web.de> wrote:

 Subject: Re: [Tutor] Regular expression on python
 To: tutor at python.org
 Date: Tuesday, April 14, 2015, 4:37 PM

 Steven D'Aprano wrote:

 > On Tue, Apr 14, 2015 at 10:00:47AM +0200, Peter Otten
 wrote:
 >> Steven D'Aprano wrote:
 > 
 >> > I swear that Perl has been a blight on an
 entire generation of
 >> > programmers. All they know is regular
 expressions, so they turn every
 >> > data processing problem into a regular
 expression. Or at least they
 >> > *try* to. As you have learned, regular
 expressions are hard to read,
 >> > hard to write, and hard to get correct.
 >> > 
 >> > Let's write some Python code instead.
 > [...]
 > 
 >> The tempter took posession of me and dictated:
 >> 
 >> >>> pprint.pprint(
 >> ... [(k, int(v)) for k, v in
 >> ...
 re.compile(r"(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*").findall(line)])
 >> [('Input Read Pairs', 2127436),
 >>  ('Both Surviving', 1795091),
 >>  ('Forward Only Surviving', 17315),
 >>  ('Reverse Only Surviving', 6413),
 >>  ('Dropped', 308617)]
 > 
 > Nicely done :-)
 > 

Yes, nice, but why do you use 
re.compile(regex).findall(line) 
and not
re.findall(regex, line)

I know what re.compile is for. I often use it outside a loop and then actually use the compiled regex inside a loop, I just haven't see the way you use it before.

 > I didn't say that it *couldn't* be done with a regex. 

 I didn't claim that.

 > Only that it is
 > harder to read, write, etc. Regexes are good tools, but
 they aren't the
 > only tool and as a beginner, which would you rather
 debug? The extract()
 > function I wrote, or
 r"(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*" ?

 I know a rhetorical question when I see one ;)

 > Oh, and for the record, your solution is roughly 4-5
 times faster than
 > the extract() function on my computer. 

 I wouldn't be bothered by that. See below if you are.

 > If I knew the requirements were
 > not likely to change (that is, the maintenance burden
 was likely to be
 > low), I'd be quite happy to use your regex solution in
 production code,
 > although I would probably want to write it out in
 verbose mode just in
 > case the requirements did change:
 > 
 > 
 > r"""(?x)    (?# verbose mode)

personally, I prefer to be verbose about being verbose, ie use the re.VERBOSE flag. But perhaps that's just a matter of taste. Are there any use cases when the ?iLmsux operators are clearly a better choice than the equivalent flag? For me, the mental burden of a regex is big enough already without these operators. 

 >     (.+?):  (?# capture one or
 more character, followed by a colon)
 >     \s+     (?#
 one or more whitespace)
 >     (\d+)   (?#
 capture one or more digits)
 >     (?:     (?#
 don't capture ... )
 >       \s+   
    (?# one or more whitespace)
 >   
    \(.*?\)   (?# anything
 inside round brackets)
 >       )?     
   (?# ... and optional)
 >     \s*     (?#
 ignore trailing spaces)
 >     """

<snip>