[Tutor] extracting phrases and their memberships from syntax

Fri Feb 13 16:20:28 CET 2009

Pyparsing has a built-in helper called nestedExpr that fits neatly in with
this data.  Here is the whole script:

from pyparsing import nestedExpr

syntax_tree = nestedExpr()
results = syntax_tree.parseString(st_data)

from pprint import pprint
pprint(results.asList())

Prints:

[[['S',
   ['NP-SBJ-1',
    ['NP', ['NNP', 'Rudolph'], ['NNP', 'Agnew']],
    [',', ','],
    ['UCP',
     ['ADJP', ['NP', ['CD', '55'], ['NNS', 'years']], ['JJ', 'old']],
     ['CC', 'and'],
     ['NP',
      ['NP', ['JJ', 'former'], ['NN', 'chairman']],
      ['PP',
       ['IN', 'of'],
       ['NP',
        ['NNP', 'Consolidated'],
        ['NNP', 'Gold'],
        ['NNP', 'Fields'],
        ['NNP', 'PLC']]]]],
    [',', ',']],
   ['VP',
    ['VBD', 'was'],
    ['VP',
     ['VBN', 'named'],
     ['S',
      ['NP-SBJ', ['-NONE-', '*-1']],
      ['NP-PRD',
       ['NP', ['DT', 'a'], ['JJ', 'nonexecutive'], ['NN', 'director']],
       ['PP',
        ['IN', 'of'],
        ['NP',
         ['DT', 'this'],
         ['JJ', 'British'],
         ['JJ', 'industrial'],
         ['NN', 'conglomerate']]]]]]],
   ['.', '.']]]]

If you want to delve deeper into this, you could, since the content of the
() groups is so regular.  You in essence reconstruct nestedExpr in your own
code, but you do get some increased control and visibility to the parsed
content.

Since this is a recursive syntax, you will need to use pyparsing's mechanism
for recursion, which is the Forward class.  Forward is sort of a "I can't
define the whole thing yet, just create a placeholder" placeholder.

syntax_element = Forward()
LPAR,RPAR = map(Suppress,"()")
syntax_tree = LPAR + syntax_element + RPAR

Now in your example, a syntax_element can be one of 4 things:
- a punctuation mark, twice
- a syntax marker followed by one or more syntax_trees
- a syntax marker followed by a word
- a syntax tree

Here is how I define those:

marker = oneOf("VBD ADJP VBN JJ DT PP NN UCP NP-PRD "
                "NP NNS NNP CC NP-SBJ-1 CD VP -NONE- "
                "IN NP-SBJ S")
punc = oneOf(", . ! ?")

wordchars = printables.replace("(","").replace(")","")

syntax_element << (
    punc + punc | 
    marker + OneOrMore(Group(syntax_tree)) | 
    marker + Word(wordchars) |
    syntax_tree )

Note that we use '<<' operator to "inject" the definition of a
syntax_element - we can't use '=' or we would get a different expression
than the one we used to define syntax_tree.

Now parse the string, and voila!  Same as before.

Here is the entire script:

from pyparsing import nestedExpr, Suppress, oneOf, Forward, OneOrMore, Word,
printables, Group

syntax_element = Forward()
LPAR,RPAR = map(Suppress,"()")
syntax_tree = LPAR + syntax_element + RPAR

marker = oneOf("VBD ADJP VBN JJ DT PP NN UCP NP-PRD "
                "NP NNS NNP CC NP-SBJ-1 CD VP -NONE- "
                "IN NP-SBJ S")
punc = oneOf(", . ! ?")

wordchars = printables.replace("(","").replace(")","")

syntax_element << (
    punc + punc | 
    marker + OneOrMore(Group(syntax_tree)) | 
    marker + Word(wordchars) |
    syntax_tree )

results = syntax_tree.parseString(st_data)
from pprint import pprint
pprint(results.asList())

-- Paul