Extract sentences in nested parentheses using Python
A S
aishan0403 at gmail.com
Tue Dec 3 08:41:18 EST 2019
On Tuesday, 3 December 2019 01:01:25 UTC+8, Peter Otten wrote:
> A S wrote:
>
> I think I've seen this question before ;)
>
> > I am trying to extract all strings in nested parentheses (along with the
> > parentheses itself) in my .txt file. Please see the sample .txt file that
> > I have used in this example here:
> > (https://drive.google.com/open?id=1UKc0ZgY9Fsz5O1rSeBCLqt5dwZkMaQgr).
> >
> > I have tried and done up three different codes but none of them seems to
> > be able to extract all the nested parentheses. They can only extract a
> > portion of the nested parentheses. Any advice on what I've done wrong
> > could really help!
> >
> > Here are the three codes I have done so far:
> >
> > 1st attempt:
> >
> > import re
> > from os.path import join
> >
> > def balanced_braces(args):
> > parts = []
> > for arg in args:
> > if '(' not in arg:
> > continue
>
> There could still be a ")" that you miss
>
> > chars = []
> > n = 0
> > for c in arg:
> > if c == '(':
> > if n > 0:
> > chars.append(c)
> > n += 1
> > elif c == ')':
> > n -= 1
> > if n > 0:
> > chars.append(c)
> > elif n == 0:
> > parts.append(''.join(chars).lstrip().rstrip())
> > chars = []
> > elif n > 0:
> > chars.append(c)
> > return parts
>
> It's probably easier to understand and implement when you process the
> complete text at once. Then arbitrary splits don't get in the way of your
> quest for ( and ). You just have to remember the position of the first
> opening ( and number of opening parens that have to be closed before you
> take the complete expression:
>
> level: 00011112222100
> text: abc(def(gh))ij
> when we are here^
> we need^
>
> A tentative implementation:
>
> $ cat parse.py
> import re
>
> NOT_SET = object()
>
> def scan(text):
> level = 0
> start = NOT_SET
> for m in re.compile("[()]").finditer(text):
> if m.group() == ")":
> level -= 1
> if level < 0:
> raise ValueError("underflow: more closing than opening
> parens")
> if level == 0:
> # outermost closing parenthesis:
> # deliver enclosed string including parens.
> yield text[start:m.end()]
> start = NOT_SET
> elif m.group() == "(":
> if level == 0:
> # outermost opening parenthesis: remember position.
> assert start is NOT_SET
> start = m.start()
> level += 1
> else:
> assert False
> if level > 0:
> raise ValueError("unclosed parens remain")
>
>
> if __name__ == "__main__":
> with open("lan sample text file.txt") as instream:
> text = instream.read()
> for chunk in scan(text):
> print(chunk)
> $ python3 parse.py
> ("xE'", PUT(xx.xxxx.),"'")
> ("TRUuuuth")
Hello Peter! I tried this on my actual working files and it returned this error: "unclosed parens remain". In this case, how can I continue to parse through my text files by only extracting those with balanced parentheses and ignore those that are incomplete?
More information about the Python-list
mailing list