re.findall() is skipping matching characters
ignacio at openservices.net
Mon Oct 15 23:51:22 CEST 2001
On Mon, 15 Oct 2001, Gustaf Liljegren wrote:
> Thanks for helping me out with matching/searching before. Unfortunately,
> the example I gave was a little too basic, so I need some more help.
> >>> re.search(r'<(a)', '<a href="page.html">').group()
> The search() function matches the full expression: both the '<' and the
> '(a)', which is short for a alternation between more HTML elements. The
> match() function behaves like this too:
> >>> re.match(r'<(a)', '<a href="page.html">').group()
> But look what happens when I use the findall() function:
> >>> re.findall(r'<(a)', '<a href="page.html">')
> Why does findall() skip the '<'? I want to sort out full strings like '<a
> href="page.html">' or '<area ... href="page.html">' and put them in a list.
> I imagine the full regex should look something like this according to
> today's standards:
> re_link = re.compile(r'<(a|area)\s[^>]*href[^>]*/?>', re.I | re.M)
> Where's the problem?
It's because <match>.group() takes an optional parameter specifying which
subroup to return, defaulting with 0, which specifies the entire match. Pass a
Ignacio Vazquez-Abrams <ignacio at openservices.net>
"As far as I can tell / It doesn't matter who you are /
If you can believe there's something worth fighting for."
- "Parade", Garbage
More information about the Python-list