How do I get to *all* of the groups of an re search?
claird at lairds.com
Fri Jan 10 16:34:30 CET 2003
In article <oe23f-ta3.ln1 at news.lairds.org>,
Kyler Laird <Kyler at news.Lairds.org> wrote:
>Sure, I can do that but then I have to parse the second group
>again. In this case it's fairly trivial, but in my application
>there is a lot of junk in between each of the groups.
>The text I'm matching is more like this.
> <a href="foo.html">
> blah blah blah
> <img src="fooabc.jpg">
> blah blah
> <img src="foocde.jpg">
> more stuff
>I want [('foo', ['fooabc', 'foocde'])]. I have no problem with
>getting the RE to match everything. It's just getting to all of
>the matched groups that's stopping me.
>If I use the RE you gave, I'll end up with something like this.
> [('foo', ' blah blah blah <img src="fooabc.jpg"> blah blah <img
>That's going to require me to reprocess the second element. It's
>inefficient and ugly. Worse, it's not what I expected from the
>description in the documentation.
1. Harvey Thomas, in a nearby follow-up (how're
gateway propagation delays today?) has sum-
marized the main point far more aptly than
anything I wrote: "You can't return a vari-
able number of groups from a regex." (but
can Perl people? They apparently tried to
cram cement mixers, kitchen toasters, and
turbojets inside their REs, so, who knows?)
2. Oh, what you *really* want is HTML parsing.
There are serious limits to RE's applicability
in that role, as the columnists of <URL: http://
assert. Get an HTML parser--then be ready to
tweak it to accept all the junk that roams
around in the wild.
Cameron Laird <Cameron at Lairds.com>
More information about the Python-list