Extract sentences in nested parentheses using Python

Thu Dec 5 10:31:54 EST 2019

On Tuesday, 3 December 2019 23:48:21 UTC+8, Peter Otten  wrote:
> A S wrote:
> 
> > On Tuesday, 3 December 2019 01:01:25 UTC+8, Peter Otten  wrote:
> >> A S wrote:
> >> 
> >> I think I've seen this question before ;)
> >> 
> >> > I am trying to extract all strings in nested parentheses (along with
> >> > the parentheses itself) in my .txt file. Please see the sample .txt
> >> > file that I have used in this example here:
> >> > (https://drive.google.com/open?id=1UKc0ZgY9Fsz5O1rSeBCLqt5dwZkMaQgr).
> >> > 
> >> > I have tried and done up three different codes but none of them seems
> >> > to be able to extract all the nested parentheses. They can only extract
> >> > a portion of the nested parentheses. Any advice on what I've done wrong
> >> > could really help!
> >> > 
> >> > Here are the three codes I have done so far:
> >> > 
> >> > 1st attempt:
> >> > 
> >> > import re
> >> > from os.path import join
> >> > 
> >> > def balanced_braces(args):
> >> >     parts = []
> >> >     for arg in args:
> >> >         if '(' not in arg:
> >> >             continue
> >> 
> >> There could still be a ")" that you miss
> >> 
> >> >         chars = []
> >> >         n = 0
> >> >         for c in arg:
> >> >             if c == '(':
> >> >                 if n > 0:
> >> >                     chars.append(c)
> >> >                 n += 1
> >> >             elif c == ')':
> >> >                 n -= 1
> >> >                 if n > 0:
> >> >                     chars.append(c)
> >> >                 elif n == 0:
> >> >                     parts.append(''.join(chars).lstrip().rstrip())
> >> >                     chars = []
> >> >             elif n > 0:
> >> >                 chars.append(c)
> >> >     return parts
> >> 
> >> It's probably easier to understand and implement when you process the
> >> complete text at once. Then arbitrary splits don't get in the way of your
> >> quest for ( and ). You just have to remember the position of the first
> >> opening ( and number of opening parens that have to be closed before you
> >> take the complete expression:
> >> 
> >> level:  00011112222100
> >> text:   abc(def(gh))ij
> >>    when we are here^
> >>     we need^
> >> 
> >> A tentative implementation:
> >> 
> >> $ cat parse.py
> >> import re
> >> 
> >> NOT_SET = object()
> >> 
> >> def scan(text):
> >>     level = 0
> >>     start = NOT_SET
> >>     for m in re.compile("[()]").finditer(text):
> >>         if m.group() == ")":
> >>             level -= 1
> >>             if level < 0:
> >>                 raise ValueError("underflow: more closing than opening
> >> parens")
> >>             if level == 0:
> >>                 # outermost closing parenthesis:
> >>                 # deliver enclosed string including parens.
> >>                 yield text[start:m.end()]
> >>                 start = NOT_SET
> >>         elif m.group() == "(":
> >>             if level == 0:
> >>                 # outermost opening parenthesis: remember position.
> >>                 assert start is NOT_SET
> >>                 start = m.start()
> >>             level += 1
> >>         else:
> >>             assert False
> >>     if level > 0:
> >>         raise ValueError("unclosed parens remain")
> >> 
> >> 
> >> if __name__ == "__main__":
> >>     with open("lan sample text file.txt") as instream:
> >>         text = instream.read()
> >>     for chunk in scan(text):
> >>         print(chunk)
> >> $ python3 parse.py
> >> ("xE'", PUT(xx.xxxx.),"'")
> >> ("TRUuuuth")
> > 
> > Hello Peter! I tried this on my actual working files and it returned this
> > error: "unclosed parens remain". In this case, how can I continue to parse
> > through my text files by only extracting those with balanced parentheses
> > and ignore those that are incomplete?
> 
> filenames = ...
> for filename in filenames:
>     with open(filename) as instream:
>         text = instream.read()
>         try:
>             chunks = list(scan(text))
>         except ValueError as err:
>             print(f"{err} in file {filename!r}", file=sys.stderr)
>         else:
>            for chunk in chunks:
>                print(chunk)

hey Peter, it works! Thank you :)