regexp: extracting multiple multiline groups

Alex Martelli aleax at aleax.it
Fri Oct 4 04:12:31 EDT 2002


<posted & mailed>

Steven Bethard wrote:
> I have an input file that looks something like:
> 
> --- 1
> A description that
> could be multiple lines
> --- 2
> Another description
> ...
> 
> I'd like to extract both the number and the corresponding description for
> each entry.  Right now, I do this by:
> 
>  docNumbersMatcher = re.compile(r"^--- (\d+)$", re.MULTILINE)
>  docNumbers = docNumbersMatcher.findall(output)
> 
>  docBoundaryMatcher = re.compile("^--- \d+$", re.MULTILINE)
>  docs = docBoundaryMatcher.split(output)
> 
> However, it seems a waste to run through the same document twice with
> essentially the same expression.  Is there a way to do this with a single
> pass?  I've tried a few things, but they typically take too much or too
> little.  For example:

You need non-greedy matching and look-forward:

import re

m = re.compile(r'^--- (\d+)\n(.*?)((?=--- \d+\n)|\Z)',
    re.MULTILINE | re.DOTALL)

output = """\
--- 1
A description that
could be multiple lines
--- 2
Another description
also potentially multiline
--- 3
But descriptions can also be singleline
--- 4
although they
need not
be
"""

print m.findall(output)


This prints:

[alex at lancelot ba]$ python mure.py
[('1', 'A description that\ncould be multiple lines\n', ''), ('2', 'Another 
description\nalso potentially multiline\n', ''), ('3', 'But descriptions 
can also be singleline\n', ''), ('4', 'although they\nneed not\nbe\n', '')]

If the empty-group '' at the end of each tuple is a bother, you can
also use nongrouping parentheses 

m = re.compile(r'^--- (\d+)\n(.*?)(?:(?=--- \d+\n)|\Z)',
    re.MULTILINE | re.DOTALL)

and now:

[alex at lancelot ba]$ python mure.py
[('1', 'A description that\ncould be multiple lines\n'), ('2', 'Another 
description\nalso potentially multiline\n'), ('3', 'But descriptions can 
also be singleline\n'), ('4', 'although they\nneed not\nbe\n')]
[alex at lancelot ba]$


One more detail -- get into the habit of using r'...' notation (raw
string literals) for pattern strings of RE's -- or else one day or
another some \ escape sequence you didn't know or think about will
produce surprising results...


Alex




More information about the Python-list mailing list