How do I get to *all* of the groups of an re search?

Andrew Dalke adalke at mindspring.com
Sat Jan 11 00:21:14 EST 2003


Kyler Laird wrote:
> Thank you.  Any idea how I should have know to find that from
> here?
> 	http://www.python.org/doc/current/lib/re-syntax.html

Well, the regexp syntax is different from how a regexp engine
might use that syntax.  Eg,  Martel uses the same syntax (or
nearabouts) but a totally different API for working with it.

Therefore, the syntax and the how-to-access-the-data shouldn't
go together.

> Yes, and that surprises me.  It seems so obvious that it
> should return all matched pieces and so arbitrary that it
> only returns the last one.

Consider what that API might be.  You want a list of
all matches for a group.  That wouldn't help me since I
want the tree structure.  Others (which is most nearly
everyone!) just want one match, so having to do group(1)[0]
all the time would get annoying.

(There could be another API to access multimatch, but then
consider the data overhead for



> It's not documented where a user is likely to look for the
> RE syntax.  The RE syntax page gives what appears to be a
> very straightforward explanation of groups.  I don't know
> why a beginning Python user would think to look elsewhere
> for some strange behavior.

And there's nothing there about how to use the compile function
nor the 'groups()' method nor ... It's syntax only.

It could add a comment that the "\" refers to the most recently
matched group.  But I don't think it's needed.

BTW, what do you think

   (([a-z]+) (\2[0-9]+)\s*)+

against

   spam spam5 eggs eggs45 45

should do?  If you want group 2 to return all matches then
does that mean \2 should refer to any of the previous matches
of the [a-z]+ term?  There is actually some coherency between
the current behaviour of .group(2) and \2.



> Regardless, do you find it useful?  Can you think of any time
> when you want to match a bunch of things and just end up with
> the last one?

Hmm.  Okay, you wanted a way for findall to return your searches.
I find that inelegant because I would rather express my query
in a single regexp.  Semantically, re.compile(pattern).findall()
matches the same string as

    re.compile("(?:(?:%s).*)*)"

That's why my pattern in Martel was written as

    pattern = Martel.Re(r"(?P<word>[a-z]+)( (?P<var>(?P=word)[0-9]+))+")
    full_pattern = pattern + Martel.Rep(Martel.Str(" ") + pattern)

With Martel this is possible because it passes back the full
tree, and preserves the order between different matches.  For
you, if you did that, you would have had ... I don't know what
you would have had.

In your language, I don't find your solution useful.

The most common case is to get a single word, not a list of
matches.  Most cases which get a list of items can use the findall
method, as in

 >>> s = "1,2,45,73,4,345"
 >>> re.findall("\d+", s)
['1', '2', '45', '73', '4', '345']
 >>>

This does match a bunch of things to get a list of items.
Syntactically it's very weak, but it captures most people's
needs.  What you want is pretty unusual.

Don't get me wrong, I was really annoyed that I needed to
write Martel to solve my problem.  I wanted the code to
already exist.  But what you want still wouldn't have
pleased me.


> I can solve it lots of ways.  I went for what I thought was
> going to be an elegant solution.  I'd like to have a tool
> that works the way I expected the re module to work.

Martel?  :)

>>Show me a module besides Martel which lets you get access to
>>the parse tree.  I looked at about a dozen packages, read
>>through Friedl's 1st edition book, and posted to various newsgroups
>>looking for one.
> 
> 
> I'm not at all interested in how popular the solution is.

The thing is, few people need what you want.  It isn't popular,
which makes it less likely to exist.

So re doesn't do what you want and is unlikely going to add
support for it any time soon.

					Andrew
					dalke at dalkescientific.com





More information about the Python-list mailing list