hierarchicial named groups extension to the re library

I've written an extension to the re library, to provide a more complete matching of hierarchical named groups in regular expressions. I've set up a sourceforge project for it: http://pyre2.sourceforge.net/ re2 extracts a hierarchy of named groups matches from a string, rather than the flat, incomplete dictionary that the standard re module returns. (ie. the re library only returns the ~last~ match for named groups - not a list of ~all~ the matches for the named groups. And the hierarchy of those named groups is non-existant in the flat dictionary of matches that results. ) eg.
import re buf='12 drummers drumming, 11 pipers piping, 10 lords a-leaping' regex='^((?P<verse>(?P<number>\d+) (?P<activity>[^,]+))(, )?)*$' pat1=re.compile(regex) m=pat1.match(buf) m.groupdict() {'verse': '10 lords a-leaping', 'number': '10', 'activity': 'lords a-leaping'}
import re2 buf='12 drummers drumming, 11 pipers piping, 10 lords a-leaping' regex='^((?P<verse>(?P<number>\d+) (?P<activity>[^,]+))(, )?)*$' pat2=re2.compile(regex) x=pat2.extract(buf) x {'verse': [{'number': '12', 'activity': 'drummers drumming'}, {'number': '11', 'activity': 'pipers piping'}, {'number': '10', 'activity': 'lords a-leaping'}]}
(See http://pyre2.sourceforge.net/ for more details.) I am wondering what would be the best direction to take this project in. Firstly is it, (or can it be made) useful enough to be included in the python stdlib? (ie. Should I bother writing a PEP for it.) And if so, would it be best to merge its functionality in with the re library, or to leave it as a separate module? And, also are there any suggestions/criticisms on the library itself?

ottrey@py.redsoft.be wrote:
import re2 buf='12 drummers drumming, 11 pipers piping, 10 lords a-leaping' regex='^((?P<verse>(?P<number>\d+) (?P<activity>[^,]+))(, )?)*$' pat2=re2.compile(regex) x=pat2.extract(buf) x
{'verse': [{'number': '12', 'activity': 'drummers drumming'}, {'number': '11', 'activity': 'pipers piping'}, {'number': '10', 'activity': 'lords a-leaping'}]}
Is a dictionary the good container or should another class be used? Because in the example the content of the "verse" group is lost, excluding its sub-groups. Something like a hierarchic MatchObject could provide access to both information, the sub-groups and the group itself. Also, should it be limited to named groups?
I am wondering what would be the best direction to take this project in.
Firstly is it, (or can it be made) useful enough to be included in the python stdlib? (ie. Should I bother writing a PEP for it.)
And if so, would it be best to merge its functionality in with the re library, or to leave it as a separate module?
And, also are there any suggestions/criticisms on the library itself?
I find the feature very interesting, but being used to live without it, I have difficulty evaluating its usefulness. However, it reminds me how much at first I found strange that only the last match was kept, so I think, FWIW, that on a purist point of vue the functionality would make sense in the stdlib in some way or another. Regards, Nicolas

Nicolas Fleury <nidoizo@yahoo.com> wrote:
ottrey@py.redsoft.be wrote:
import re2 buf='12 drummers drumming, 11 pipers piping, 10 lords a-leaping' regex='^((?P<verse>(?P<number>\d+) (?P<activity>[^,]+))(, )?)*$' pat2=re2.compile(regex) x=pat2.extract(buf)
If one wanted to match the API of the re module, one should use pat2.findall(buf), which would return a list of 'hierarchical match objects', though with the above, one should really return a list of 'verse' items (the way the regular expression is written).
x
{'verse': [{'number': '12', 'activity': 'drummers drumming'}, {'number': '11', 'activity': 'pipers piping'}, {'number': '10', 'activity': 'lords a-leaping'}]}
Is a dictionary the good container or should another class be used? Because in the example the content of the "verse" group is lost, excluding its sub-groups. Something like a hierarchic MatchObject could provide access to both information, the sub-groups and the group itself.
Its contents are not lost, look at the overall dictionary... In any case, I think one can do better than a dictionary.
x=pat2.match(buf) #or x=pat2.findall(buf)[0] x '12 drummers drumming,' dir(x) ['verse'] x.verse '12 drummers drumming,' dir(x.verse) ['number', 'activity'] x.verse.number '12' x.verse.activity 'drummers drumming'
...would get my vote (or using obj.group(i) semantics I discuss below). I notice that this is basically what the re2 module already does (having read the web page), though rather than...
pat2.extract(buf).verse[1].activity 'pipers piping'
I would prefer...
pat2.findall(buf)[1].verse.activity 'pipers piping'
For .verse[1] or .verse[2] to make sense, it implies that the pattern is something like... ((?P<verse>... )(?P<verse>...)) ... which it isn't. I understand that the decision was probably made to make it similar to the case of... ((?P<foo>... (?p<goo>...)+)) ... where multiple matches for goo would require x.foo.goo[i].
Also, should it be limited to named groups?
Probably not. I would suggest using matchobj.group(i) semantics to match the standard re module semantics, though only allow returning items in the current level of the hierarchy. That is, one could use x.verse.group(1) and get back '12', but x.group(1) would return '12 pipers piping'
I am wondering what would be the best direction to take this project in.
Firstly is it, (or can it be made) useful enough to be included in the python stdlib? (ie. Should I bother writing a PEP for it.)
And if so, would it be best to merge its functionality in with the re library, or to leave it as a separate module?
And, also are there any suggestions/criticisms on the library itself?
I find the feature very interesting, but being used to live without it, I have difficulty evaluating its usefulness. However, it reminds me how much at first I found strange that only the last match was kept, so I think, FWIW, that on a purist point of vue the functionality would make sense in the stdlib in some way or another.
re2 can be used as a limited structural parser. This makes the re module useful for more things than it is currently. The question of it being in the standard library, however, I think should be made based on the criteria used previously (whatever they were). - Josiah

Josiah Carlson wrote:
Nicolas Fleury <nidoizo@yahoo.com> wrote:
ottrey@py.redsoft.be wrote:
import re2 buf='12 drummers drumming, 11 pipers piping, 10 lords a-leaping' regex='^((?P<verse>(?P<number>\d+) (?P<activity>[^,]+))(, )?)*$' pat2=re2.compile(regex) x=pat2.extract(buf)
If one wanted to match the API of the re module, one should use pat2.findall(buf), which would return a list of 'hierarchical match objects', though with the above, one should really return a list of 'verse' items (the way the regular expression is written).
As far as I can understand, the two are orthogonal. findall is used to match the regular expression multiple times; in that case the regular expression is still matched only once.
{'verse': [{'number': '12', 'activity': 'drummers drumming'}, {'number': '11', 'activity': 'pipers piping'}, {'number': '10', 'activity': 'lords a-leaping'}]}
Is a dictionary the good container or should another class be used? Because in the example the content of the "verse" group is lost, excluding its sub-groups. Something like a hierarchic MatchObject could provide access to both information, the sub-groups and the group itself.
Its contents are not lost, look at the overall dictionary... In any case, I think one can do better than a dictionary.
In that specific example, I meant that the space between "10" and "lords a-leaping" was not stored in the dictionary, unless you talk about the dictionary from re instead of re2. Your proposal fixes that, by making the entire content of the parent group (verse) accessible.
x=pat2.match(buf) #or x=pat2.findall(buf)[0] x
'12 drummers drumming,'
dir(x)
['verse']
x.verse
'12 drummers drumming,'
It is very easy to use, but I doubt it is a good idea as a return value for match (maybe a match object could have a function to return this easy-to-use object). It would mean that the name of the groups are limited by the interface of the match object returned (what would happen if a group is named "start", "end" of simpliy "group"?). Another solution is to use x["verse"] instead (or continue use a "group" method).
Also, should it be limited to named groups?
Probably not. I would suggest using matchobj.group(i) semantics to match the standard re module semantics, though only allow returning items in the current level of the hierarchy. That is, one could use x.verse.group(1) and get back '12', but x.group(1) would return '12 pipers piping'
Totally agree that matchobj.group interface should be matched. Should group return another match object? Or maybe another function to get match objects of groups? Something like: x.groupobj("verse").group("number") or str(x["verse"]["number"]) Regards, Nicolas

Josiah Carlson wrote:
re2 can be used as a limited structural parser. This makes the re module useful for more things than it is currently. The question of it being in the standard library, however, I think should be made based on the criteria used previously (whatever they were).
In general, if developers can readily agree that a functionality should be added (i.e. it is "obvious" for some reason), it is added right away. Otherwise, a PEP should be written, and reviewed by the community. In the specific case, Chris Ottrey submitted a link to his project to the SF patches tracker, asking for inclusion. I felt that there is likely no immediate agreement, and suggested he asks on python-dev, and writes a PEP. If this kind of functionality would fall on immediate rejection for some reason, even writing the PEP might be pointless. If the functionality is generally considered useful, a PEP can be written, and then implemented according to the PEP procedures (i.e. collect feedback, discuss alternatives, ask for BDFL pronouncement). I personally think that the proposed functionality should *not* live in a separate module, but somehow be integrated into SRE. Whether or not the proposed functionality is useful in the first place, I don't know. I never have nested named groups in my regular expressions. Regards, Martin

At 08:48 AM 4/3/05 +0200, Martin v. Löwis wrote:
I personally think that the proposed functionality should *not* live in a separate module, but somehow be integrated into SRE.
+1.
Whether or not the proposed functionality is useful in the first place, I don't know. I never have nested named groups in my regular expressions.
Neither have I, but only because it doesn't do what re2 does. :) I'd like to suggest that the addition also allow you to match a group by a named reference, thus allowing a complete grammar to be formed. Of course, I don't know if the underlying regular expression engine could actually do that, but it would be nice if it could, since it would allow simple grammars to be more easily parsed without recourse to a more complex parsing module.

Greetings,
If this kind of functionality would fall on immediate rejection for some reason, even writing the PEP might be pointless. If the [...]
In my opinion the functionality is useful.
I personally think that the proposed functionality should *not* live in a separate module, but somehow be integrated into SRE. Whether or [...]
Agreed. I propose to integrate this functionality into the SRE syntax, so that this special kind of group may be used when explicitly wanted. This would avoid backward compatibility problems, would give each regular expression a single meaning, and would allow interleaving hierarchical/non-hierarchical groups. I offer myself to integrate the change once we decide on the right way to implement it, and achieve consensus on its adoption. Best regards, -- Gustavo Niemeyer http://niemeyer.net

On Sunday 03 April 2005 16:48, Martin v. Löwis wrote:
If this kind of functionality would fall on immediate rejection for some reason, even writing the PEP might be pointless.
Note that even if something is rejected, the PEP itself is useful - it collects knowledge in a format that's far more accessible than searching the mailing list archives. (note that I'm not talking about this particular case, but about PEPs in general - I have no opinion on the current proposal, because I'm not a heavy user of REs) -- Anthony Baxter <anthony@interlink.com.au> It's never too late to have a happy childhood.

(ie. the re library only returns the ~last~ match for named groups - not a list of ~all~ the matches for the named groups. And the hierarchy of
<ottrey@py.redsoft.be> wrote: those named groups is non-existant in the flat dictionary of matches
that results. )
are you 100% sure that this can be implemented on top of other RE engines (CPython isn't the only Python implementation out there). (generally speaking, trying to turn an RE engine into a parser is a lousy idea. the library would benefit more from a simple parser toolkit than it benefits from more non-standard and highly specialized RE hacks...) </F>
participants (8)
-
"Martin v. Löwis"
-
Anthony Baxter
-
Fredrik Lundh
-
Gustavo Niemeyer
-
Josiah Carlson
-
Nicolas Fleury
-
ottrey@py.redsoft.be
-
Phillip J. Eby