re.findall() is skipping matching characters

Fredrik Lundh fredrik at
Mon Oct 15 23:21:21 CEST 2001

Gustaf Liljegren wrote:
> But look what happens when I use the findall() function:
> >>> re.findall(r'<(a)', '<a href="page.html">')
> ['a']
> Why does findall() skip the '<'? I want to sort out full strings like '<a
> href="page.html">' or '<area ... href="page.html">' and put them in a list.
> I imagine the full regex should look something like this according to
> today's standards:
> re_link = re.compile(r'<(a|area)\s[^>]*href[^>]*/?>', re.I | re.M)
> Where's the problem?

from the re.findall documentation: "If one or more groups are
present in the pattern, return a list of groups; this will be a list
of tuples if the pattern has more than one group"

try using a non-capturing group instead: (?:x) instead of (x)

or better, use the right tool for the task: sgmllib

import sgmllib

class Parser(sgmllib.SGMLParser):
    def __init__(self):
        self.hrefs = []
    def href(self, attrib):
        for k, v in attrib:
            if k == "href":
    do_a = do_area = href

p = Parser()

p.feed("some html text")

print p.hrefs


<!-- (the eff-bot guide to) the python standard library:

More information about the Python-list mailing list