Extract sentences in nested parentheses using Python
Peter Otten
__peter__ at web.de
Tue Dec 3 10:47:49 EST 2019
A S wrote:
> On Tuesday, 3 December 2019 01:01:25 UTC+8, Peter Otten wrote:
>> A S wrote:
>>
>> I think I've seen this question before ;)
>>
>> > I am trying to extract all strings in nested parentheses (along with
>> > the parentheses itself) in my .txt file. Please see the sample .txt
>> > file that I have used in this example here:
>> > (https://drive.google.com/open?id=1UKc0ZgY9Fsz5O1rSeBCLqt5dwZkMaQgr).
>> >
>> > I have tried and done up three different codes but none of them seems
>> > to be able to extract all the nested parentheses. They can only extract
>> > a portion of the nested parentheses. Any advice on what I've done wrong
>> > could really help!
>> >
>> > Here are the three codes I have done so far:
>> >
>> > 1st attempt:
>> >
>> > import re
>> > from os.path import join
>> >
>> > def balanced_braces(args):
>> > parts = []
>> > for arg in args:
>> > if '(' not in arg:
>> > continue
>>
>> There could still be a ")" that you miss
>>
>> > chars = []
>> > n = 0
>> > for c in arg:
>> > if c == '(':
>> > if n > 0:
>> > chars.append(c)
>> > n += 1
>> > elif c == ')':
>> > n -= 1
>> > if n > 0:
>> > chars.append(c)
>> > elif n == 0:
>> > parts.append(''.join(chars).lstrip().rstrip())
>> > chars = []
>> > elif n > 0:
>> > chars.append(c)
>> > return parts
>>
>> It's probably easier to understand and implement when you process the
>> complete text at once. Then arbitrary splits don't get in the way of your
>> quest for ( and ). You just have to remember the position of the first
>> opening ( and number of opening parens that have to be closed before you
>> take the complete expression:
>>
>> level: 00011112222100
>> text: abc(def(gh))ij
>> when we are here^
>> we need^
>>
>> A tentative implementation:
>>
>> $ cat parse.py
>> import re
>>
>> NOT_SET = object()
>>
>> def scan(text):
>> level = 0
>> start = NOT_SET
>> for m in re.compile("[()]").finditer(text):
>> if m.group() == ")":
>> level -= 1
>> if level < 0:
>> raise ValueError("underflow: more closing than opening
>> parens")
>> if level == 0:
>> # outermost closing parenthesis:
>> # deliver enclosed string including parens.
>> yield text[start:m.end()]
>> start = NOT_SET
>> elif m.group() == "(":
>> if level == 0:
>> # outermost opening parenthesis: remember position.
>> assert start is NOT_SET
>> start = m.start()
>> level += 1
>> else:
>> assert False
>> if level > 0:
>> raise ValueError("unclosed parens remain")
>>
>>
>> if __name__ == "__main__":
>> with open("lan sample text file.txt") as instream:
>> text = instream.read()
>> for chunk in scan(text):
>> print(chunk)
>> $ python3 parse.py
>> ("xE'", PUT(xx.xxxx.),"'")
>> ("TRUuuuth")
>
> Hello Peter! I tried this on my actual working files and it returned this
> error: "unclosed parens remain". In this case, how can I continue to parse
> through my text files by only extracting those with balanced parentheses
> and ignore those that are incomplete?
filenames = ...
for filename in filenames:
with open(filename) as instream:
text = instream.read()
try:
chunks = list(scan(text))
except ValueError as err:
print(f"{err} in file {filename!r}", file=sys.stderr)
else:
for chunk in chunks:
print(chunk)
More information about the Python-list
mailing list