Extract sentences in nested parentheses using Python
Peter Otten
__peter__ at web.de
Mon Dec 2 12:00:50 EST 2019
A S wrote:
I think I've seen this question before ;)
> I am trying to extract all strings in nested parentheses (along with the
> parentheses itself) in my .txt file. Please see the sample .txt file that
> I have used in this example here:
> (https://drive.google.com/open?id=1UKc0ZgY9Fsz5O1rSeBCLqt5dwZkMaQgr).
>
> I have tried and done up three different codes but none of them seems to
> be able to extract all the nested parentheses. They can only extract a
> portion of the nested parentheses. Any advice on what I've done wrong
> could really help!
>
> Here are the three codes I have done so far:
>
> 1st attempt:
>
> import re
> from os.path import join
>
> def balanced_braces(args):
> parts = []
> for arg in args:
> if '(' not in arg:
> continue
There could still be a ")" that you miss
> chars = []
> n = 0
> for c in arg:
> if c == '(':
> if n > 0:
> chars.append(c)
> n += 1
> elif c == ')':
> n -= 1
> if n > 0:
> chars.append(c)
> elif n == 0:
> parts.append(''.join(chars).lstrip().rstrip())
> chars = []
> elif n > 0:
> chars.append(c)
> return parts
It's probably easier to understand and implement when you process the
complete text at once. Then arbitrary splits don't get in the way of your
quest for ( and ). You just have to remember the position of the first
opening ( and number of opening parens that have to be closed before you
take the complete expression:
level: 00011112222100
text: abc(def(gh))ij
when we are here^
we need^
A tentative implementation:
$ cat parse.py
import re
NOT_SET = object()
def scan(text):
level = 0
start = NOT_SET
for m in re.compile("[()]").finditer(text):
if m.group() == ")":
level -= 1
if level < 0:
raise ValueError("underflow: more closing than opening
parens")
if level == 0:
# outermost closing parenthesis:
# deliver enclosed string including parens.
yield text[start:m.end()]
start = NOT_SET
elif m.group() == "(":
if level == 0:
# outermost opening parenthesis: remember position.
assert start is NOT_SET
start = m.start()
level += 1
else:
assert False
if level > 0:
raise ValueError("unclosed parens remain")
if __name__ == "__main__":
with open("lan sample text file.txt") as instream:
text = instream.read()
for chunk in scan(text):
print(chunk)
$ python3 parse.py
("xE'", PUT(xx.xxxx.),"'")
("TRUuuuth")
More information about the Python-list
mailing list