parenthesis

Tue Nov 5 19:59:56 EST 2002

On 5 Nov 2002 09:56:54 -0800, mis6 at pitt.edu (Michele Simionato) wrote:

>bokr at oz.net (Bengt Richter) wrote in message news:<aq6qun$ib4$0 at 216.39.172.122>...
>> Well, they don't count, so if you want to count you have to throw in
>> something extra. E.g., you could do this, to insert a delimiter after
>> a closing right paren, and then split on the delimiter. Probably not
                                                           ^^^^^^^^^^^^
>> wonderfully efficient, and I am just duplicating what you did, except
   ^^^^^^^^^^^^^^^^^^^^^-[1]

>> the regex separates the chunks for me.
>> 
>>  >>> import re
>>  >>> rx = re.compile(r'([()]|[^()]*)')
>>  >>> class Addelim:
>>  ...     def __init__(self): self.parens=0
>>  ...     def __call__(self, m):
>>  ...         s = m.group(1)
>>  ...         if s=='(': self.parens+=1
>>  ...         if self.parens==1 and s==')':
>>  ...             self.parens=0
>>  ...             return s+'\x00'
>>  ...         if s==')': self.parens -=1
>>  ...         return s
>>  ...
>>  >>> for e in rx.sub(Addelim(),exp).split('\x00'): print e
>>  ...
>>  (a*(b+c*(2-x))+d)
>>  +f(s1)
>> 
>> Where exp was
>>  >>> exp
>>  '(a*(b+c*(2-x))+d)+f(s1)'
>> 
>> Regards,
>> Bengt Richter
>
>Very interesting approach. But it is even more interesting to compare
Yes, I had just learned from a post of Alex M that sub can take a function,
so I thought it would be interesting to use it ;-)

>its
>speed with the simple minded approach. I thought your algorithm was
>going to
>be the fastest, since you do not split the initial string chars by
I didn't think so [1] ;-)
>chars in Python, but let the regular expression do the job.
Well, doing a job that doesn't really need to be done never increases speed ;-)

I.e., for your original problem, there is no need to create an alternative string
with delimiters after expressions, never mind  to split that string again to get
at just the first item. All that's needed is to look at the input string and count
to find out where the first expression ends.

The question then becomes, what is the fastest way to look for parentheses and make
sure you've counted some before detecting balanced count.

If parentheses were separated by thousands of characters, I would expect that a regex search
or finditer for [()] would go faster than a simple s[i] indexed scan, but for short expressions
such as the example, there's probably a crossover point. Certainly '(x)' can be scanned brute force
faster than you can get results from a regex object method.

For speed, I think it might be faster to count position but not use it to get the
characters, e.g.,

 >>> def get_1st(exp):
 ...     parens=i=0
 ...     for c in exp:
 ...         i += 1
 ...         if c=='(': parens+=1
 ...         elif c==')':
 ...             parens -=1
 ...             if not parens: return exp[:i]
 ...
 >>> exp = '(a*(b+c*(2-x))+d)+f(s1)'
 >>> get_1st(exp)
 '(a*(b+c*(2-x))+d)'

This is a little different from Lee Harr's post (which BTW looks to me
like it depends on the first char being '('). I'd expect the above to run a
little faster.

>However a simple benchmark (not subtracting the overhead times) gives:
>
>parenthesized_group:  130-140 microseconds
>Addelim:           620-640 microseconds
Yes, Addelim has calling overhead for one, and is creating a new string by substitution
for another, so it's doing a lot of un-needed work.
>
>The simple minded approach is more than four-five times faster!
>
>I think this is rather instructive. Addelim is particular inefficient
>for
>long expressions, since it analizes the full expression whereas
>parenthesized_group stops at the end of the first parenthesized group.
Sure. The cost of generalizing requirements instead of sticking to specs ;-)

>For fun I run Addelim on exp*100 (i.e. the 100 times the original
>string): it takes more than 50000 microseconds whereas
>parenthesized_group
>is still on 140 microseconds.
Not exactly unanticipated, I'm sure ;-)
>
>It is good to have yet another proof of the dangers involved with
>regular
>expressions !

There are dangers in drawing too general conclusions from particular
examples too ;-)

Regards,
Bengt Richter