Regexp finditer() fails to match some non-overlapping matches?

Philip Jägenstedt philipj at telia.com
Sat May 3 14:36:07 EDT 2003


Hello.

I'll start out by giving you some ugly code to look at.

import re

pattern = r"""
(?P<lf>\r\n?|\n) |
(?P<list>^[*#]*) |
(?P<bold>'{2}(?P<bold_text>.+?)'{2}) |
"""
rules = re.compile(pattern, re.MULTILINE | re.VERBOSE)

str="""
''bold text''

*Bullet list
*bullet ''no 2''
**Multiple levels
*#And mixing types too!
"""

This is the issue: I expect the regexp to match the <list> group at
the beginning of each line, and if there are no * or # characters
(which I use as markup for lists) there should be a zero-length match.
I've removed most other groups and use the <bold> group to illustrate
where a problem arises. If there is bold text markup "''bold text''"
at the beginning of the line, this won't be matched if there has been
the zero-length match from the <list> group. If however, I change the
<list> group to "(?P<list>^[*#]+)" (+ instead of *) the <bold> group
will match, because there was no zero-length match before.

The same effect can be seen from this short script:
#!/usr/bin/python

import re

str="__Bullet lists__"

pattern = r"^|__.+__"

rules = re.compile(pattern)

for m in rules.finditer(str):
    print m.start(), m.end(), m.group()


In this case, the string "__Bullet lists__" will not be matched,
because there is the zero-length match before it.

So what it boils down to is: why doesn't finditer() match both the
beginning of the line, and some other thing that lives at the
beginning of the line?

For example:

|_|_|b|o|l|d|_|_|
0 1 2 3 4 5 6 7 8

I'd like to have a zero-length match (0-0) since there are no # or *
characters, and then the 0-8 match for __bold__. But, I cannot see how
to do it.

I have the problem using Debian GNU/Linux testing, with python 2.2.1.
If any other information is needed, do ask!

Thanks for any help you can offer!




More information about the Python-list mailing list