parenthesis

Bengt Richter bokr at oz.net
Mon Nov 4 21:25:57 EST 2002


On 4 Nov 2002 22:05:11 GMT, bokr at oz.net (Bengt Richter) wrote:

>On 4 Nov 2002 12:24:31 -0800, mis6 at pitt.edu (Michele Simionato) wrote:
>
>>Suppose I want to parse the following expression:
>>
>>>>> exp='(a*(b+c*(2-x))+d)+f(s1)'
>>
>>I want to extract the first part, i.e. '(a*(b+c*(2-x))+d)'.
>>
[... previous version ...]

Wondering why I didn't just write:

 >>> import re
 >>> rx = re.compile(r'([()]|[^()]+)')
 >>> class Addelim:
 ...     def __init__(self, delim):
 ...        self.parens=0; self.delim=delim
 ...     def __call__(self, m):
 ...         s = m.group(1)
 ...         if s=='(': self.parens+=1
 ...         if self.parens==1 and s==')':
 ...             self.parens=0
 ...             return s+self.delim
 ...         if s==')': self.parens -=1
 ...         return s
 ...
 >>> exp =  '(a*(b+c*(2-x))+d)+f(s1)'

It was natural to be able to specify the delimiter. And the + is probably
better than the * on the non-paren "[^()]+" part of the pattern.
Then using \n as delimiter to break into lines one can just print it.

 >>> print rx.sub(Addelim('\n'),exp)
 (a*(b+c*(2-x))+d)
 +f(s1)

Which you could also use like:

 >>> print rx.sub(Addelim('\n'),exp).splitlines()
 ['(a*(b+c*(2-x))+d)', '+f(s1)']

Or to get back to your original requirement,

 >>> print rx.sub(Addelim('\n'),exp).splitlines()[0]
 (a*(b+c*(2-x))+d)

But I suspect it would run faster to let a regex split the string and then use
a loop like yours on the pieces, which would be '(' or ')' or some other string
that you don't need to look at character by character. E.g.,

 >>> rx = re.compile(r'([()])')
 >>> ss = rx.split(exp)
 >>> ss
 ['', '(', 'a*', '(', 'b+c*', '(', '2-x', ')', '', ')', '+d', ')', '+f', '(', 's1', ')', '']

Notice that the splitter matches wind up at the odd indices. I think that's generally true
when you put parens around the splitting expression, to return the matches as part of the list,
but I'm not 100% certain. Anyway, you could make use of that, something like:

 >>>
 >>> parens = 0
 >>> endix = []
 >>> for i in range(1,len(ss),2):
 ...     if parens==1 and ss[i]==')':
 ...         parens=0; endix.append(i+1)
 ...     elif ss[i]=='(': parens += 1
 ...     else:            parens -= 1
 ...
 >>> endix
 [12, 16]

You could break the loop like you did if you just want the first expression,
or you could grab it by

 >>> print ''.join(ss[:endix[0]])
 (a*(b+c*(2-x))+d)

or list the bunch,

 >>> lo=0
 >>> for hi in endix:
 ...     print ''.join(ss[lo:hi])
 ...     lo = hi
 ...
 (a*(b+c*(2-x))+d)
 +f(s1)

or whatever. Which is not as slick, but probably faster if you had to do a bi-ig bunch of them.

I think when the fenceposts are simple, but you are mainly interested in the data between, splitting
on a fencepost regex and processing the resulting list can be simpler and faster than trying to
do it all with a complex regex.

Regards,
Bengt Richter



More information about the Python-list mailing list