Extract sentences in nested parentheses using Python
A S
aishan0403 at gmail.com
Thu Dec 5 10:31:54 EST 2019
On Tuesday, 3 December 2019 23:48:21 UTC+8, Peter Otten wrote:
> A S wrote:
>
> > On Tuesday, 3 December 2019 01:01:25 UTC+8, Peter Otten wrote:
> >> A S wrote:
> >>
> >> I think I've seen this question before ;)
> >>
> >> > I am trying to extract all strings in nested parentheses (along with
> >> > the parentheses itself) in my .txt file. Please see the sample .txt
> >> > file that I have used in this example here:
> >> > (https://drive.google.com/open?id=1UKc0ZgY9Fsz5O1rSeBCLqt5dwZkMaQgr).
> >> >
> >> > I have tried and done up three different codes but none of them seems
> >> > to be able to extract all the nested parentheses. They can only extract
> >> > a portion of the nested parentheses. Any advice on what I've done wrong
> >> > could really help!
> >> >
> >> > Here are the three codes I have done so far:
> >> >
> >> > 1st attempt:
> >> >
> >> > import re
> >> > from os.path import join
> >> >
> >> > def balanced_braces(args):
> >> > parts = []
> >> > for arg in args:
> >> > if '(' not in arg:
> >> > continue
> >>
> >> There could still be a ")" that you miss
> >>
> >> > chars = []
> >> > n = 0
> >> > for c in arg:
> >> > if c == '(':
> >> > if n > 0:
> >> > chars.append(c)
> >> > n += 1
> >> > elif c == ')':
> >> > n -= 1
> >> > if n > 0:
> >> > chars.append(c)
> >> > elif n == 0:
> >> > parts.append(''.join(chars).lstrip().rstrip())
> >> > chars = []
> >> > elif n > 0:
> >> > chars.append(c)
> >> > return parts
> >>
> >> It's probably easier to understand and implement when you process the
> >> complete text at once. Then arbitrary splits don't get in the way of your
> >> quest for ( and ). You just have to remember the position of the first
> >> opening ( and number of opening parens that have to be closed before you
> >> take the complete expression:
> >>
> >> level: 00011112222100
> >> text: abc(def(gh))ij
> >> when we are here^
> >> we need^
> >>
> >> A tentative implementation:
> >>
> >> $ cat parse.py
> >> import re
> >>
> >> NOT_SET = object()
> >>
> >> def scan(text):
> >> level = 0
> >> start = NOT_SET
> >> for m in re.compile("[()]").finditer(text):
> >> if m.group() == ")":
> >> level -= 1
> >> if level < 0:
> >> raise ValueError("underflow: more closing than opening
> >> parens")
> >> if level == 0:
> >> # outermost closing parenthesis:
> >> # deliver enclosed string including parens.
> >> yield text[start:m.end()]
> >> start = NOT_SET
> >> elif m.group() == "(":
> >> if level == 0:
> >> # outermost opening parenthesis: remember position.
> >> assert start is NOT_SET
> >> start = m.start()
> >> level += 1
> >> else:
> >> assert False
> >> if level > 0:
> >> raise ValueError("unclosed parens remain")
> >>
> >>
> >> if __name__ == "__main__":
> >> with open("lan sample text file.txt") as instream:
> >> text = instream.read()
> >> for chunk in scan(text):
> >> print(chunk)
> >> $ python3 parse.py
> >> ("xE'", PUT(xx.xxxx.),"'")
> >> ("TRUuuuth")
> >
> > Hello Peter! I tried this on my actual working files and it returned this
> > error: "unclosed parens remain". In this case, how can I continue to parse
> > through my text files by only extracting those with balanced parentheses
> > and ignore those that are incomplete?
>
> filenames = ...
> for filename in filenames:
> with open(filename) as instream:
> text = instream.read()
> try:
> chunks = list(scan(text))
> except ValueError as err:
> print(f"{err} in file {filename!r}", file=sys.stderr)
> else:
> for chunk in chunks:
> print(chunk)
hey Peter, it works! Thank you :)
More information about the Python-list
mailing list