[Python-Dev] hierarchicial named groups extension to the re library

Sun Apr 3 09:24:49 CEST 2005

Nicolas Fleury <nidoizo at yahoo.com> wrote:
>
> ottrey at py.redsoft.be wrote:
> >>>>import re2
> >>>>buf='12 drummers drumming, 11 pipers piping, 10 lords a-leaping'
> >>>>regex='^((?P<verse>(?P<number>\d+) (?P<activity>[^,]+))(, )?)*$'
> >>>>pat2=re2.compile(regex)
> >>>>x=pat2.extract(buf)
> >>>>x
> >
> > {'verse': [{'number': '12', 'activity': 'drummers
> > drumming'}, {'number': '11', 'activity': 'pipers
> > piping'}, {'number': '10', 'activity': 'lords a-leaping'}]}
>
> Is a dictionary the good container or should another class be used?
> Because in the example the content of the "verse" group is lost,
> excluding its sub-groups.  Something like a hierarchic MatchObject could
> provide access to both information, the sub-groups and the group itself.

Yes, very good point.
Actually it ~is~ a container (that uses dict as it's base class).
(I probably should add the following lines to the example.)

>>> type(x)
<class 're2._Match'>
>>> x._value
'12 drummers drumming, 11 pipers piping, 10 lords a-leaping'
>>> x.verse[0]._value
'12 drummers drumming'

Josiah Carlson jcarlson at uci.edu wrote:
> If one wanted to match the API of the re module, one should use
> pat2.findall(buf), which would return a list of 'hierarchical match
> objects'

Well, that would be something I'd want to discuss here.
As I'm not sure if I actually ~want~ to match the API of the re module.

> Also, should it be limited to named groups?

I have given that some thought as well.
Internally un-named groups are recursively given the names _group0,
_group1 etc as they are found.  And then those groups are recursively
matched. And in the final step the resulting _Match object is compressed
and those un-named groups are discarded.

IMO If you don't bother to name a group then you probably aren't going
to be interested in it anyway - so why keeping a reference to it?

eg.
If you only wanted to extract the numbers from those verses...

>>> regex='^(((?P<number>\d+) ([^,]+))(, )?)*$'
>>> pat2=re2.compile(regex)
>>> x=pat2.extract(buf)
>>> x
{'number': ['12', '11', '10']}

Before the compression stage the _Match object actually looked like this:

{'_group0': {'_value': '12 drummers drumming, 11 pipers piping, 10
lords
a-leaping', '_group0': [{'_value': '12 drummers drumming, ',
'_group1':
', ', '_group0': {'_value': '12 drummers drumming', '_group1':
'drummers
drumming', 'number': '12'}}, {'_value': '11 pipers piping, ',
'_group1':
', ', '_group0': {'_value': '11 pipers piping', '_group1':
'pipers
piping', 'number': '11'}}, {'_value': '10 lords a-leaping',
'_group0':
{'_value': '10 lords a-leaping', '_group1': 'lords a-leaping',
'number':
'10'}}]}}

But the compression algorithm collected the named groups and brought
them to the surface, to return the much nicer looking:

{'number': ['12', '11', '10']}

NB. There are also a few other tricks up the sleeve of re2.

eg.
It allows for named groups to be repeated in different branches of a
named group hierarchy, without the name redefinition error that the re
library will complain about.

eg.
>>> pat1=re2.compile(
  '(?P<parents>(?P<mother>(?P<name>[\w ]+)),(?P<father>(?P<name>[\w
]+)))'
)
>>> pat1.extract('Mum,Dad')
{'parents': {'father': {'name': 'Dad'}, 'mother': {'name':
'Mum'}}}

> I find the feature very interesting, but being used to live without it,
> I have difficulty evaluating its usefulness.

Yes - this is a good point too, because it ~is~ different from the re
library.  re2 aims to do all that searching, grouping, iterating and
collecting and constructing work for you.

> However, it reminds me how much at first I found strange that only the
> last match was kept, so I think, FWIW, that on a purist point of vue the
> functionality would make sense in the stdlib in some way or another.

Actually that "last match only" confusion was part of the motivation for
writing it in the first place.

> For .verse[1] or .verse[2] to make sense, it implies that the pattern is
> something like...
> ((?P<verse>... )(?P<verse>...))
> ... which it isn't.

Good pickup!
You've seen through my smoke and mirrors.  ;-)
That list of verses was actually created in the compression stage.
(The stage that I failed to mention in my first post.)

ie. The regex was:

((?P<verse>(?P<number>\d+) (?P<activity>[^,]+))(, )?)*

Which returns an un-named list of verse groups.

Something like:

{'_group0': [ {'verse': {'number': '12', 'activity': 'drummers
drumming'}, {'verse': {'number': '11', 'activity': 'pipers
piping'}},
{'verse': {'number': '10', 'activity': 'lords a-leaping'}}]}

But the compression algorithm discarded that '_group0' key and brought
the 'verse' groups to the surface, then grouped them together in one
'verse' list.

ie. to make:

{'verse': [{'number': '12', 'activity': 'drummers
drumming'}, {'number': '11', 'activity': 'pipers
piping'}, {'number': '10', 'activity': 'lords a-leaping'}]}

> > Also, should it be limited to named groups?
>
> Probably not.  I would suggest using matchobj.group(i) semantics to
> match the standard re module semantics, though only allow returning
> items in the current level of the hierarchy.  That is, one could use
> x.verse.group(1) and get back '12', but x.group(1) would return '12
> pipers piping'

Actually, I ~would~ like to limit it to just named groups.
I reckon, if you're not going to bother naming a group, then why would
you have any interest in it.
I guess its up for discussion how confusing this "new" way of thinking
could be and what drawbacks it might have.

Regards.

Chris.