Extract sentences in nested parentheses using Python
DL Neil
PythonList at DancesWithMice.info
Mon Dec 2 17:22:19 EST 2019
On 3/12/19 6:00 AM, Peter Otten wrote:
> A S wrote:
> I think I've seen this question before ;)
In addition to 'other reasons' for @Peter's comment, it is a common
ComSc worked-problem or assignment. (in which case, we'd appreciate
being told that you/OP is asking for help with "homework")
>> I am trying to extract all strings in nested parentheses (along with the
>> parentheses itself) in my .txt file. Please see the sample .txt file that
>> I have used in this example here:
>> (https://drive.google.com/open?id=1UKc0ZgY9Fsz5O1rSeBCLqt5dwZkMaQgr).
>>
>> I have tried and done up three different codes but none of them seems to
>> be able to extract all the nested parentheses. They can only extract a
>> portion of the nested parentheses. Any advice on what I've done wrong
>> could really help!
One approach is to research in the hope that there are already existing
tools or techniques which may help/save you from 'reinventing the wheel'
- when you think about it, a re-statement of open-source objectives.
How does the Python interpreter break-down Python (text) code into its
constituent parts ("tokens") *including* parentheses? Such are made
available in (perhaps) a lesser-known corner of the PSL (Python Standard
Library). Might you be able to use one such tool?
The ComSc technique which sprang to mind involves "stacks" (a LIFO data
structure) and "RPN" (Reverse Polish Notation). Whereas we like people
to take their turn when it comes to limited resources, eg to form a
"queue" to purchase/pay for goods at the local store, which is "FIFO"
(first-in, first-out); a "stack"/LIFO (last-in, first-out) can be
problematic to put into practical application. There are plenty of
Python implementations or you can 'roll your own' with a list. Again,
I'd likely employ a "deque" from the PSL's Collections library (although
as a "stack" rather than as a "double-ended queue"), because the
optimisation comes "free". (to my laziness, but after some kind soul
sweated-bullets to make it fast (in both senses) for 'the rest of us'!)
> It's probably easier to understand and implement when you process the
> complete text at once. Then arbitrary splits don't get in the way of your
> quest for ( and ). You just have to remember the position of the first
> opening ( and number of opening parens that have to be closed before you
> take the complete expression:
+1
but as a 'silver surfer', I don't like to be asked to "remember" anything!
> level: 00011112222100
> we need^
Consider:
original_text (the contents of the .txt file - add buffering if volumes
are huge)
current_text (the characters we have processed/"recognised" so-far)
stack (what an original name for such a data-structure! Which will
contain each of the partial parenthetical expressions found - but yet to
be proven/made complete)
set current_text to NULSTRING
for each current_character in original_text:
if current_character is LEFT_PARENTHESIS:
push current_text to stack
set current_text to LEFT_PARENTHESIS
concatenate current_character with current_text
if current_character is RIGHT_PARENTHESIS:
# current_text is a parenthetical expression
# do with it what you will
pop the stack
set current_text to the ex-stack string \
concat current_text's p-expn
Once working: cover 'special cases' (after above loop), eg original_text
which doesn't begin and/or end with parentheses; and error cases, eg
pop-ping a NULSTRING, or thinking things are finished but the stack is
not yet empty - likely events from unbalanced parentheses!
original text = "abc(def(gh))ij"
event 1: in-turn, concatenate characters "abc" as current_text
event 2: locate (first) left-parenthesis, push current_text to stack(&)
event 3: concatenate "(def"
event 4: push, likewise
event 5: concatenate "(gh"
event 6: locate (first) right-parenthesis (matches to left-parenthesis
begining the current_string!)
result?: ?print current_text?
event 7: pop stack and redefine current_text as "(def(gh)"
event 8: repeat, per event 6
event 9: current_text will now become "(def(gh))"
event 10: (special case) run-out of input at "(def(gh)ij"
event 11: (special case) pop (!any) stack and report "abc(def(gh)"
NB not sure of need for a "level" number; but if required, you can infer
that at any time, from the depth/len() of the stack!
NBB being a simple-boy, my preference is for a 'single layer' of code,
cf @Peter's generator. Regardless the processes are "tokenisation" and
"recognition".
At the back of my mind, was the notion that you may (eventually) be
required to work with more than parentheses, eg pair-wise
square-brackets and/or quotation-marks. In which case, you will need to
also 'push' the token and check for token-pairs when 'pop-ping', as well
as (perhaps) recognising lists of tokens to tokenise instead of the two
parenthesis characters alone. In which case, I'd take a serious look at
the Python Language Services rather than taking a 'roll your own' approach!
Contrarily, if this spec is 'it', then you might consider optimising the
search processes which 'trigger' the two stack operations, by re-working
the for-loop and utilising string.find() - prioritising whichever
parenthesis is left-most/comes-first - assuming LtR text. (apologies if
you have already tried this in one of your previous approaches)
Unfortunately, such likely results in 'layers' of code, and a generator
might well become the tool-of-choice (I say this before @Peter comes
back and (quite deservedly) flays me alive!).
WebRefs:
Python Language Services: https://docs.python.org/3/library/language.html
collections — Container datatypes:
https://docs.python.org/3/library/collections.html
See also your ComSc text/reference materials.
--
Regards =dn
More information about the Python-list
mailing list