Undocumented regex behaviour in re module

Dave Cole djc at object-craft.com.au
Tue Jul 4 23:18:30 EDT 2000


Consider the following program:
- - re_test.py - - - - - - - - - - - - - - - - - - - - - - - - - -
import re

text = '''<html>
  <head>
    <title>Browse <?table?></title>
  </head>
  <body>
    <h1>Browse <?table?></h1>

      <?browse start=0 num=25?>

  </body>
</html>
'''

tags = re.compile(r'<\?(\w+)(\s+\w+=\w+)*\?>', re.I | re.M)
match = tags.search(text)
while match:
    print text[match.start():match.end()], '=>', match.groups()
    match = tags.search(text, match.end())
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The idea here is that I want to be able to extract the special tag
names and their attributes.  Everything work fine, except for tag:

        <?browse start=0 num=25?>

The match object only saves the last attr=value matched.  The only way
that I think of to get all of the attr=value returned is to change the
regex to:

        <\?(\w+)((?:\s+\w+=\w+)*)\?>

Unfortunately, that is not as useful as I would like since it returns
a string which needs further processing: ' start=0 num=25'

Any hint / explanation at this point would be gratefully accepted.

- Dave



More information about the Python-list mailing list