regular expressions: grabbing variables from multiple matches

Alex Martelli aleaxit at yahoo.com
Thu Jan 4 06:28:32 EST 2001


"Heather Lynn White" <hwhite at chiliad.com> wrote in message
news:mailman.978567423.22249.python-list at python.org...
>
> Suppose I have a regular expression to grab all variations on a meta tag,
> and I will want to extract from any matches the name and content values
> for this tag.
>
> I use the following re
>
> MetaTag=re.compile(
>
r'''<\s*?(meta|META)\s*?=\s*?"(?P<name>.*?)"\s*?(content|CONTENT)\s*?=\s*?"(
?P<content>.*?)"\s*?>'''
> )

Key issue here: there are some groups that are not of interest (the first
and third ones, unnamed) and some that are (second and fourth, also named).


> now suppose I have an html document and I want to iterate through all the
> meta tags in that document. If I only catch one, I would say
>
> matches=MetaTag.match(body)

You need to use *search*, NOT *match*, unless you want your
search anchored at the start (i.e., body starts right with
the leading '<' of the metatag you're looking for).

> if matches:
> flds=matches.groupdict()
> name=flds["name"]
> content=flds["content"]
> print name, content
>
> but this does not work if I use instead findall, to get multiple matches,
> because findall returns a list of matches rather than a list of match
> objects, unlike all the other functions.  Is there a way to extract these
> variables in the way I have done above, but with many matches?

_Almost_, but you lose the named-fields convenience.  If you runn
MetaTag.findall(body) for some 'body' such as, say:

body = '''
<meta="pippo" content="pluto">
irrelevant stuff
<meta="foo" content="bar">
'''

you'll see findall returns:

[('meta', 'pippo', 'content', 'pluto'), ('meta', 'foo', 'content', 'bar')]

i.e., a list of tuples, each tuple corresponding to a match; items
of the tuple correspond to the groups you have defined in your re
(by using parentheses), whether you named them or not -- the naming
itself, however, is not preserved.

Note that you could restrict the items in the tuples to those of
your interest by NOT using parentheses (which define groups) for
items you don't care about; use (?:...) 'non grouping parentheses'
if you choose this route.  I.e., change your metatag re to:

MetaTag1=re.compile(
r'''<\s*?(?:meta|META)\s*?=\s*?"(?P<name>.*?)"\s*?(?:content|CONTENT)\s*?=\s
*?"(?P<content>.*?)"\s*?>'''
)

and findall will return just [('pippo', 'pluto'), ('foo', 'bar')].

Then, you could elegantly loop over the name/content pairs found:

for name, content in MetaTag1.findall(body):
    print name, content

If you need to keep grouping-parentheses, you'll just have to
'swallow' the resulting not-very-interesting parts of each
match, e.g. using your original definition of MetaTag:

for junk1, name, junk2, content in MetaTag.findall(body):
    print name, content


Unfortunately, I know of no way to preserve the named nature
of your groups, while 'mass-searching' the string for all
non-overlapping matches; if you DO absolutely need the
groups to be accessed by-name rather than by-order in the
tuples findall returns, you'll have to loop differently,
and, alas, a bit more complicatedly...:

pos = 0
while 1:
    match=MetaTag.match(body, pos)
    if match is None: break
    pos = match.end()
    flds=match.groupdict()
    name=flds["name"]
    content=flds["content"]
    print name, content

I think that, in general, the by-position access technique
will prove preferable.  Still, each of these approaches
may come in useful at times.


Alex






More information about the Python-list mailing list