[Tutor] Regular expression on python
Peter Otten
__peter__ at web.de
Tue Apr 14 16:37:14 CEST 2015
Steven D'Aprano wrote:
> On Tue, Apr 14, 2015 at 10:00:47AM +0200, Peter Otten wrote:
>> Steven D'Aprano wrote:
>
>> > I swear that Perl has been a blight on an entire generation of
>> > programmers. All they know is regular expressions, so they turn every
>> > data processing problem into a regular expression. Or at least they
>> > *try* to. As you have learned, regular expressions are hard to read,
>> > hard to write, and hard to get correct.
>> >
>> > Let's write some Python code instead.
> [...]
>
>> The tempter took posession of me and dictated:
>>
>> >>> pprint.pprint(
>> ... [(k, int(v)) for k, v in
>> ... re.compile(r"(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*").findall(line)])
>> [('Input Read Pairs', 2127436),
>> ('Both Surviving', 1795091),
>> ('Forward Only Surviving', 17315),
>> ('Reverse Only Surviving', 6413),
>> ('Dropped', 308617)]
>
> Nicely done :-)
>
> I didn't say that it *couldn't* be done with a regex.
I didn't claim that.
> Only that it is
> harder to read, write, etc. Regexes are good tools, but they aren't the
> only tool and as a beginner, which would you rather debug? The extract()
> function I wrote, or r"(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*" ?
I know a rhetorical question when I see one ;)
> Oh, and for the record, your solution is roughly 4-5 times faster than
> the extract() function on my computer.
I wouldn't be bothered by that. See below if you are.
> If I knew the requirements were
> not likely to change (that is, the maintenance burden was likely to be
> low), I'd be quite happy to use your regex solution in production code,
> although I would probably want to write it out in verbose mode just in
> case the requirements did change:
>
>
> r"""(?x) (?# verbose mode)
> (.+?): (?# capture one or more character, followed by a colon)
> \s+ (?# one or more whitespace)
> (\d+) (?# capture one or more digits)
> (?: (?# don't capture ... )
> \s+ (?# one or more whitespace)
> \(.*?\) (?# anything inside round brackets)
> )? (?# ... and optional)
> \s* (?# ignore trailing spaces)
> """
>
>
> That's a hint to people learning regular expressions: start in verbose
> mode, then "de-verbose" it if you must.
Regarding the speed of the Python approach: you can easily improve that by
relatively minor modifications. The most important one is to avoid the
exception:
$ python parse_jarod.py
$ python3 parse_jarod.py
The regex for reference:
$ python3 -m timeit -s "from parse_jarod import extract_re as extract"
"extract()"
100000 loops, best of 3: 18.6 usec per loop
Steven's original extract():
$ python3 -m timeit -s "from parse_jarod import extract_daprano as extract"
"extract()"
10000 loops, best of 3: 92.6 usec per loop
Avoid raising ValueError (This won't work with negative numbers):
$ python3 -m timeit -s "from parse_jarod import extract_daprano2 as extract"
"extract()"
10000 loops, best of 3: 44.3 usec per loop
Collapse the two loops into one, thus avoiding the accumulator list and the
isinstance() checks:
$ python3 -m timeit -s "from parse_jarod import extract_daprano3 as extract"
"extract()"
10000 loops, best of 3: 29.6 usec per loop
Ok, this is still slower than the regex, a result that I cannot accept.
Let's try again:
$ python3 -m timeit -s "from parse_jarod import extract_py as extract"
"extract()"
100000 loops, best of 3: 15.1 usec per loop
Heureka? The "winning" code is brittle and probably as hard to understand as
the regex. You can judge for yourself if you're interested:
$ cat parse_jarod.py
import re
line = ("Input Read Pairs: 2127436 "
"Both Surviving: 1795091 (84.38%) "
"Forward Only Surviving: 17315 (0.81%) "
"Reverse Only Surviving: 6413 (0.30%) "
"Dropped: 308617 (14.51%)")
_findall = re.compile(r"(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*").findall
def extract_daprano(line=line):
# Extract key:number values from the string.
line = line.strip() # Remove leading and trailing whitespace.
words = line.split()
accumulator = [] # Collect parts of the string we care about.
for word in words:
if word.startswith('(') and word.endswith('%)'):
# We don't care about percentages in brackets.
continue
try:
n = int(word)
except ValueError:
accumulator.append(word)
else:
accumulator.append(n)
# Now accumulator will be a list of strings and ints:
# e.g. ['Input', 'Read', 'Pairs:', 1234, 'Both', 'Surviving:', 1000]
# Collect consecutive strings as the key, int to be the value.
results = {}
keyparts = []
for item in accumulator:
if isinstance(item, int):
key = ' '.join(keyparts)
keyparts = []
if key.endswith(':'):
key = key[:-1]
results[key] = item
else:
keyparts.append(item)
# When we have finished processing, the keyparts list should be empty.
if keyparts:
extra = ' '.join(keyparts)
print('Warning: found extra text at end of line "%s".' % extra)
return results
def extract_daprano2(line=line):
words = line.split()
accumulator = []
for word in words:
if word.startswith('(') and word.endswith('%)'):
continue
if word.isdigit():
word = int(word)
accumulator.append(word)
results = {}
keyparts = []
for item in accumulator:
if isinstance(item, int):
key = ' '.join(keyparts)
keyparts = []
if key.endswith(':'):
key = key[:-1]
results[key] = item
else:
keyparts.append(item)
# When we have finished processing, the keyparts list should be empty.
if keyparts:
extra = ' '.join(keyparts)
print('Warning: found extra text at end of line "%s".' % extra)
return results
def extract_daprano3(line=line):
results = {}
keyparts = []
for word in line.split():
if word.startswith("("):
continue
if word.isdigit():
key = ' '.join(keyparts)
keyparts = []
if key.endswith(':'):
key = key[:-1]
results[key] = int(word)
else:
keyparts.append(word)
# When we have finished processing, the keyparts list should be empty.
if keyparts:
extra = ' '.join(keyparts)
print('Warning: found extra text at end of line "%s".' % extra)
return results
def extract_re(line=line):
return {k: int(v) for k, v in _findall(line)}
def extract_py(line=line):
key = None
result = {}
for part in line.split(":"):
if key is None:
key = part
else:
value, new_key = part.split(None, 1)
result[key] = int(value)
key = new_key.rpartition(")")[-1].strip()
return result
if __name__ == "__main__":
assert (extract_daprano() == extract_re() == extract_daprano2()
== extract_daprano3() == extract_py())
More information about the Tutor
mailing list