[Python-Dev] Re: hierarchicial named groups extension to the re library

Sun Apr 3 01:01:57 CEST 2005

Nicolas Fleury <nidoizo at yahoo.com> wrote:
> 
> ottrey at py.redsoft.be wrote:
> >>>>import re2
> >>>>buf='12 drummers drumming, 11 pipers piping, 10 lords a-leaping'
> >>>>regex='^((?P<verse>(?P<number>\d+) (?P<activity>[^,]+))(, )?)*$'
> >>>>pat2=re2.compile(regex)
> >>>>x=pat2.extract(buf)

If one wanted to match the API of the re module, one should use
pat2.findall(buf), which would return a list of 'hierarchical match
objects', though with the above, one should really return a list of
'verse' items (the way the regular expression is written).

> >>>>x
> > 
> > {'verse': [{'number': '12', 'activity': 'drummers
> > drumming'}, {'number': '11', 'activity': 'pipers
> > piping'}, {'number': '10', 'activity': 'lords a-leaping'}]}
> 
> Is a dictionary the good container or should another class be used? 
> Because in the example the content of the "verse" group is lost, 
> excluding its sub-groups.  Something like a hierarchic MatchObject could 
> provide access to both information, the sub-groups and the group itself. 

Its contents are not lost, look at the overall dictionary...  In any
case, I think one can do better than a dictionary.

>>> x=pat2.match(buf) #or x=pat2.findall(buf)[0]
>>> x
'12 drummers drumming,'
>>> dir(x)
['verse']
>>> x.verse
'12 drummers drumming,'
>>> dir(x.verse)
['number', 'activity']
>>> x.verse.number
'12'
>>> x.verse.activity
'drummers drumming'

...would get my vote (or using obj.group(i) semantics I discuss below).
I notice that this is basically what the re2 module already does (having
read the web page), though rather than...
>>> pat2.extract(buf).verse[1].activity
'pipers piping'

I would prefer...

>>> pat2.findall(buf)[1].verse.activity
'pipers piping'

For .verse[1] or .verse[2] to make sense, it implies that the pattern is
something like...
((?P<verse>... )(?P<verse>...))
... which it isn't.

I understand that the decision was probably made to make it similar to
the case of...
((?P<foo>... (?p<goo>...)+))

... where multiple matches for goo would require x.foo.goo[i].

>   Also, should it be limited to named groups?

Probably not.  I would suggest using matchobj.group(i) semantics to
match the standard re module semantics, though only allow returning
items in the current level of the hierarchy.  That is, one could use
x.verse.group(1) and get back '12', but x.group(1) would return '12
pipers piping'

> > I am wondering what would be the best direction to take this project in.
> > 
> > Firstly is it, (or can it be made) useful enough to be included in the
> > python stdlib?  (ie. Should I bother writing a PEP for it.)
> > 
> > And if so, would it be best to merge its functionality in with the re
> > library, or to leave it as a separate module?
> > 
> > And, also are there any suggestions/criticisms on the library itself?
> 
> I find the feature very interesting, but being used to live without it, 
> I have difficulty evaluating its usefulness.  However, it reminds me how 
> much at first I found strange that only the last match was kept, so I 
> think, FWIW, that on a purist point of vue the functionality would make 
> sense in the stdlib in some way or another.

re2 can be used as a limited structural parser.  This makes the re
module useful for more things than it is currently. The question of it
being in the standard library, however, I think should be made based on
the criteria used previously (whatever they were).

 - Josiah