Undocumented regex behaviour in re module
Dave Cole
djc at object-craft.com.au
Tue Jul 4 23:18:30 EDT 2000
Consider the following program:
- - re_test.py - - - - - - - - - - - - - - - - - - - - - - - - - -
import re
text = '''<html>
<head>
<title>Browse <?table?></title>
</head>
<body>
<h1>Browse <?table?></h1>
<?browse start=0 num=25?>
</body>
</html>
'''
tags = re.compile(r'<\?(\w+)(\s+\w+=\w+)*\?>', re.I | re.M)
match = tags.search(text)
while match:
print text[match.start():match.end()], '=>', match.groups()
match = tags.search(text, match.end())
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
The idea here is that I want to be able to extract the special tag
names and their attributes. Everything work fine, except for tag:
<?browse start=0 num=25?>
The match object only saves the last attr=value matched. The only way
that I think of to get all of the attr=value returned is to change the
regex to:
<\?(\w+)((?:\s+\w+=\w+)*)\?>
Unfortunately, that is not as useful as I would like since it returns
a string which needs further processing: ' start=0 num=25'
Any hint / explanation at this point would be gratefully accepted.
- Dave
More information about the Python-list
mailing list