regexp: extracting multiple multiline groups
Alex Martelli
aleax at aleax.it
Fri Oct 4 04:12:31 EDT 2002
<posted & mailed>
Steven Bethard wrote:
> I have an input file that looks something like:
>
> --- 1
> A description that
> could be multiple lines
> --- 2
> Another description
> ...
>
> I'd like to extract both the number and the corresponding description for
> each entry. Right now, I do this by:
>
> docNumbersMatcher = re.compile(r"^--- (\d+)$", re.MULTILINE)
> docNumbers = docNumbersMatcher.findall(output)
>
> docBoundaryMatcher = re.compile("^--- \d+$", re.MULTILINE)
> docs = docBoundaryMatcher.split(output)
>
> However, it seems a waste to run through the same document twice with
> essentially the same expression. Is there a way to do this with a single
> pass? I've tried a few things, but they typically take too much or too
> little. For example:
You need non-greedy matching and look-forward:
import re
m = re.compile(r'^--- (\d+)\n(.*?)((?=--- \d+\n)|\Z)',
re.MULTILINE | re.DOTALL)
output = """\
--- 1
A description that
could be multiple lines
--- 2
Another description
also potentially multiline
--- 3
But descriptions can also be singleline
--- 4
although they
need not
be
"""
print m.findall(output)
This prints:
[alex at lancelot ba]$ python mure.py
[('1', 'A description that\ncould be multiple lines\n', ''), ('2', 'Another
description\nalso potentially multiline\n', ''), ('3', 'But descriptions
can also be singleline\n', ''), ('4', 'although they\nneed not\nbe\n', '')]
If the empty-group '' at the end of each tuple is a bother, you can
also use nongrouping parentheses
m = re.compile(r'^--- (\d+)\n(.*?)(?:(?=--- \d+\n)|\Z)',
re.MULTILINE | re.DOTALL)
and now:
[alex at lancelot ba]$ python mure.py
[('1', 'A description that\ncould be multiple lines\n'), ('2', 'Another
description\nalso potentially multiline\n'), ('3', 'But descriptions can
also be singleline\n'), ('4', 'although they\nneed not\nbe\n')]
[alex at lancelot ba]$
One more detail -- get into the habit of using r'...' notation (raw
string literals) for pattern strings of RE's -- or else one day or
another some \ escape sequence you didn't know or think about will
produce surprising results...
Alex
More information about the Python-list
mailing list