[Python-Dev] Re: hierarchicial named groups extension to the re library

Nicolas Fleury nidoizo at yahoo.com
Sun Apr 3 02:16:44 CEST 2005


Josiah Carlson wrote:
> Nicolas Fleury <nidoizo at yahoo.com> wrote:
>>ottrey at py.redsoft.be wrote:
>>
>>>>>>import re2
>>>>>>buf='12 drummers drumming, 11 pipers piping, 10 lords a-leaping'
>>>>>>regex='^((?P<verse>(?P<number>\d+) (?P<activity>[^,]+))(, )?)*$'
>>>>>>pat2=re2.compile(regex)
>>>>>>x=pat2.extract(buf)
> 
> If one wanted to match the API of the re module, one should use
> pat2.findall(buf), which would return a list of 'hierarchical match
> objects', though with the above, one should really return a list of
> 'verse' items (the way the regular expression is written).

As far as I can understand, the two are orthogonal.  findall is used to 
match the regular expression multiple times; in that case the regular 
expression is still matched only once.

>>>{'verse': [{'number': '12', 'activity': 'drummers
>>>drumming'}, {'number': '11', 'activity': 'pipers
>>>piping'}, {'number': '10', 'activity': 'lords a-leaping'}]}
>>
>>Is a dictionary the good container or should another class be used? 
>>Because in the example the content of the "verse" group is lost, 
>>excluding its sub-groups.  Something like a hierarchic MatchObject could 
>>provide access to both information, the sub-groups and the group itself. 
> 
> Its contents are not lost, look at the overall dictionary...  In any
> case, I think one can do better than a dictionary.

In that specific example, I meant that the space between "10" and "lords 
a-leaping" was not stored in the dictionary, unless you talk about the 
dictionary from re instead of re2.  Your proposal fixes that, by making 
the entire content of the parent group (verse) accessible.

>>>>x=pat2.match(buf) #or x=pat2.findall(buf)[0]
>>>>x
> 
> '12 drummers drumming,'
> 
>>>>dir(x)
> 
> ['verse']
> 
>>>>x.verse
> 
> '12 drummers drumming,'
> 

It is very easy to use, but I doubt it is a good idea as a return value 
for match (maybe a match object could have a function to return this 
easy-to-use object).  It would mean that the name of the groups are 
limited by the interface of the match object returned (what would happen 
if a group is named "start", "end" of simpliy "group"?).

Another solution is to use x["verse"] instead (or continue use a "group" 
method).

>>  Also, should it be limited to named groups?
> 
> Probably not.  I would suggest using matchobj.group(i) semantics to
> match the standard re module semantics, though only allow returning
> items in the current level of the hierarchy.  That is, one could use
> x.verse.group(1) and get back '12', but x.group(1) would return '12
> pipers piping'
> 

Totally agree that matchobj.group interface should be matched.  Should 
group return another match object?  Or maybe another function to get 
match objects of groups?  Something like:
x.groupobj("verse").group("number")
or
str(x["verse"]["number"])

Regards,
Nicolas



More information about the Python-Dev mailing list