[Python-Dev] hierarchicial named groups extension to the re
library
ottrey at py.redsoft.be
ottrey at py.redsoft.be
Sun Apr 3 09:24:49 CEST 2005
Nicolas Fleury <nidoizo at yahoo.com> wrote:
>
> ottrey at py.redsoft.be wrote:
> >>>>import re2
> >>>>buf='12 drummers drumming, 11 pipers piping, 10 lords a-leaping'
> >>>>regex='^((?P<verse>(?P<number>\d+) (?P<activity>[^,]+))(, )?)*$'
> >>>>pat2=re2.compile(regex)
> >>>>x=pat2.extract(buf)
> >>>>x
> >
> > {'verse': [{'number': '12', 'activity': 'drummers
> > drumming'}, {'number': '11', 'activity': 'pipers
> > piping'}, {'number': '10', 'activity': 'lords a-leaping'}]}
>
> Is a dictionary the good container or should another class be used?
> Because in the example the content of the "verse" group is lost,
> excluding its sub-groups. Something like a hierarchic MatchObject could
> provide access to both information, the sub-groups and the group itself.
Yes, very good point.
Actually it ~is~ a container (that uses dict as it's base class).
(I probably should add the following lines to the example.)
>>> type(x)
<class 're2._Match'>
>>> x._value
'12 drummers drumming, 11 pipers piping, 10 lords a-leaping'
>>> x.verse[0]._value
'12 drummers drumming'
Josiah Carlson jcarlson at uci.edu wrote:
> If one wanted to match the API of the re module, one should use
> pat2.findall(buf), which would return a list of 'hierarchical match
> objects'
Well, that would be something I'd want to discuss here.
As I'm not sure if I actually ~want~ to match the API of the re module.
> Also, should it be limited to named groups?
I have given that some thought as well.
Internally un-named groups are recursively given the names _group0,
_group1 etc as they are found. And then those groups are recursively
matched. And in the final step the resulting _Match object is compressed
and those un-named groups are discarded.
IMO If you don't bother to name a group then you probably aren't going
to be interested in it anyway - so why keeping a reference to it?
eg.
If you only wanted to extract the numbers from those verses...
>>> regex='^(((?P<number>\d+) ([^,]+))(, )?)*$'
>>> pat2=re2.compile(regex)
>>> x=pat2.extract(buf)
>>> x
{'number': ['12', '11', '10']}
Before the compression stage the _Match object actually looked like this:
{'_group0': {'_value': '12 drummers drumming, 11 pipers piping, 10
lords
a-leaping', '_group0': [{'_value': '12 drummers drumming, ',
'_group1':
', ', '_group0': {'_value': '12 drummers drumming', '_group1':
'drummers
drumming', 'number': '12'}}, {'_value': '11 pipers piping, ',
'_group1':
', ', '_group0': {'_value': '11 pipers piping', '_group1':
'pipers
piping', 'number': '11'}}, {'_value': '10 lords a-leaping',
'_group0':
{'_value': '10 lords a-leaping', '_group1': 'lords a-leaping',
'number':
'10'}}]}}
But the compression algorithm collected the named groups and brought
them to the surface, to return the much nicer looking:
{'number': ['12', '11', '10']}
NB. There are also a few other tricks up the sleeve of re2.
eg.
It allows for named groups to be repeated in different branches of a
named group hierarchy, without the name redefinition error that the re
library will complain about.
eg.
>>> pat1=re2.compile(
'(?P<parents>(?P<mother>(?P<name>[\w ]+)),(?P<father>(?P<name>[\w
]+)))'
)
>>> pat1.extract('Mum,Dad')
{'parents': {'father': {'name': 'Dad'}, 'mother': {'name':
'Mum'}}}
> I find the feature very interesting, but being used to live without it,
> I have difficulty evaluating its usefulness.
Yes - this is a good point too, because it ~is~ different from the re
library. re2 aims to do all that searching, grouping, iterating and
collecting and constructing work for you.
> However, it reminds me how much at first I found strange that only the
> last match was kept, so I think, FWIW, that on a purist point of vue the
> functionality would make sense in the stdlib in some way or another.
Actually that "last match only" confusion was part of the motivation for
writing it in the first place.
> For .verse[1] or .verse[2] to make sense, it implies that the pattern is
> something like...
> ((?P<verse>... )(?P<verse>...))
> ... which it isn't.
Good pickup!
You've seen through my smoke and mirrors. ;-)
That list of verses was actually created in the compression stage.
(The stage that I failed to mention in my first post.)
ie. The regex was:
((?P<verse>(?P<number>\d+) (?P<activity>[^,]+))(, )?)*
Which returns an un-named list of verse groups.
Something like:
{'_group0': [ {'verse': {'number': '12', 'activity': 'drummers
drumming'}, {'verse': {'number': '11', 'activity': 'pipers
piping'}},
{'verse': {'number': '10', 'activity': 'lords a-leaping'}}]}
But the compression algorithm discarded that '_group0' key and brought
the 'verse' groups to the surface, then grouped them together in one
'verse' list.
ie. to make:
{'verse': [{'number': '12', 'activity': 'drummers
drumming'}, {'number': '11', 'activity': 'pipers
piping'}, {'number': '10', 'activity': 'lords a-leaping'}]}
> > Also, should it be limited to named groups?
>
> Probably not. I would suggest using matchobj.group(i) semantics to
> match the standard re module semantics, though only allow returning
> items in the current level of the hierarchy. That is, one could use
> x.verse.group(1) and get back '12', but x.group(1) would return '12
> pipers piping'
Actually, I ~would~ like to limit it to just named groups.
I reckon, if you're not going to bother naming a group, then why would
you have any interest in it.
I guess its up for discussion how confusing this "new" way of thinking
could be and what drawbacks it might have.
Regards.
Chris.
More information about the Python-Dev
mailing list