[Python-Dev] Re: hierarchicial named groups extension to the re
library
Nicolas Fleury
nidoizo at yahoo.com
Sun Apr 3 02:16:44 CEST 2005
Josiah Carlson wrote:
> Nicolas Fleury <nidoizo at yahoo.com> wrote:
>>ottrey at py.redsoft.be wrote:
>>
>>>>>>import re2
>>>>>>buf='12 drummers drumming, 11 pipers piping, 10 lords a-leaping'
>>>>>>regex='^((?P<verse>(?P<number>\d+) (?P<activity>[^,]+))(, )?)*$'
>>>>>>pat2=re2.compile(regex)
>>>>>>x=pat2.extract(buf)
>
> If one wanted to match the API of the re module, one should use
> pat2.findall(buf), which would return a list of 'hierarchical match
> objects', though with the above, one should really return a list of
> 'verse' items (the way the regular expression is written).
As far as I can understand, the two are orthogonal. findall is used to
match the regular expression multiple times; in that case the regular
expression is still matched only once.
>>>{'verse': [{'number': '12', 'activity': 'drummers
>>>drumming'}, {'number': '11', 'activity': 'pipers
>>>piping'}, {'number': '10', 'activity': 'lords a-leaping'}]}
>>
>>Is a dictionary the good container or should another class be used?
>>Because in the example the content of the "verse" group is lost,
>>excluding its sub-groups. Something like a hierarchic MatchObject could
>>provide access to both information, the sub-groups and the group itself.
>
> Its contents are not lost, look at the overall dictionary... In any
> case, I think one can do better than a dictionary.
In that specific example, I meant that the space between "10" and "lords
a-leaping" was not stored in the dictionary, unless you talk about the
dictionary from re instead of re2. Your proposal fixes that, by making
the entire content of the parent group (verse) accessible.
>>>>x=pat2.match(buf) #or x=pat2.findall(buf)[0]
>>>>x
>
> '12 drummers drumming,'
>
>>>>dir(x)
>
> ['verse']
>
>>>>x.verse
>
> '12 drummers drumming,'
>
It is very easy to use, but I doubt it is a good idea as a return value
for match (maybe a match object could have a function to return this
easy-to-use object). It would mean that the name of the groups are
limited by the interface of the match object returned (what would happen
if a group is named "start", "end" of simpliy "group"?).
Another solution is to use x["verse"] instead (or continue use a "group"
method).
>> Also, should it be limited to named groups?
>
> Probably not. I would suggest using matchobj.group(i) semantics to
> match the standard re module semantics, though only allow returning
> items in the current level of the hierarchy. That is, one could use
> x.verse.group(1) and get back '12', but x.group(1) would return '12
> pipers piping'
>
Totally agree that matchobj.group interface should be matched. Should
group return another match object? Or maybe another function to get
match objects of groups? Something like:
x.groupobj("verse").group("number")
or
str(x["verse"]["number"])
Regards,
Nicolas
More information about the Python-Dev
mailing list