[Tutor] extracting phrases and their memberships from syntax

Emad Nawfal (عماد نوفل) emadnawfal at gmail.com
Sat Feb 14 15:59:08 CET 2009


On Fri, Feb 13, 2009 at 10:20 AM, Paul McGuire <ptmcg at austin.rr.com> wrote:

> Pyparsing has a built-in helper called nestedExpr that fits neatly in with
> this data.  Here is the whole script:
>
> from pyparsing import nestedExpr
>
> syntax_tree = nestedExpr()
> results = syntax_tree.parseString(st_data)
>
> from pprint import pprint
> pprint(results.asList())
>
>
> Prints:
>
> [[['S',
>   ['NP-SBJ-1',
>    ['NP', ['NNP', 'Rudolph'], ['NNP', 'Agnew']],
>    [',', ','],
>    ['UCP',
>     ['ADJP', ['NP', ['CD', '55'], ['NNS', 'years']], ['JJ', 'old']],
>     ['CC', 'and'],
>     ['NP',
>      ['NP', ['JJ', 'former'], ['NN', 'chairman']],
>      ['PP',
>       ['IN', 'of'],
>       ['NP',
>        ['NNP', 'Consolidated'],
>        ['NNP', 'Gold'],
>        ['NNP', 'Fields'],
>        ['NNP', 'PLC']]]]],
>    [',', ',']],
>   ['VP',
>    ['VBD', 'was'],
>    ['VP',
>     ['VBN', 'named'],
>     ['S',
>      ['NP-SBJ', ['-NONE-', '*-1']],
>      ['NP-PRD',
>       ['NP', ['DT', 'a'], ['JJ', 'nonexecutive'], ['NN', 'director']],
>       ['PP',
>        ['IN', 'of'],
>        ['NP',
>         ['DT', 'this'],
>         ['JJ', 'British'],
>         ['JJ', 'industrial'],
>         ['NN', 'conglomerate']]]]]]],
>   ['.', '.']]]]
>
> If you want to delve deeper into this, you could, since the content of the
> () groups is so regular.  You in essence reconstruct nestedExpr in your own
> code, but you do get some increased control and visibility to the parsed
> content.
>
> Since this is a recursive syntax, you will need to use pyparsing's
> mechanism
> for recursion, which is the Forward class.  Forward is sort of a "I can't
> define the whole thing yet, just create a placeholder" placeholder.
>
> syntax_element = Forward()
> LPAR,RPAR = map(Suppress,"()")
> syntax_tree = LPAR + syntax_element + RPAR
>
> Now in your example, a syntax_element can be one of 4 things:
> - a punctuation mark, twice
> - a syntax marker followed by one or more syntax_trees
> - a syntax marker followed by a word
> - a syntax tree
>
> Here is how I define those:
>
> marker = oneOf("VBD ADJP VBN JJ DT PP NN UCP NP-PRD "
>                "NP NNS NNP CC NP-SBJ-1 CD VP -NONE- "
>                "IN NP-SBJ S")
> punc = oneOf(", . ! ?")
>
> wordchars = printables.replace("(","").replace(")","")
>
> syntax_element << (
>    punc + punc |
>    marker + OneOrMore(Group(syntax_tree)) |
>    marker + Word(wordchars) |
>    syntax_tree )
>
> Note that we use '<<' operator to "inject" the definition of a
> syntax_element - we can't use '=' or we would get a different expression
> than the one we used to define syntax_tree.
>
> Now parse the string, and voila!  Same as before.
>
> Here is the entire script:
>
> from pyparsing import nestedExpr, Suppress, oneOf, Forward, OneOrMore,
> Word,
> printables, Group
>
> syntax_element = Forward()
> LPAR,RPAR = map(Suppress,"()")
> syntax_tree = LPAR + syntax_element + RPAR
>
> marker = oneOf("VBD ADJP VBN JJ DT PP NN UCP NP-PRD "
>                "NP NNS NNP CC NP-SBJ-1 CD VP -NONE- "
>                "IN NP-SBJ S")
> punc = oneOf(", . ! ?")
>
> wordchars = printables.replace("(","").replace(")","")
>
> syntax_element << (
>    punc + punc |
>    marker + OneOrMore(Group(syntax_tree)) |
>    marker + Word(wordchars) |
>    syntax_tree )
>
> results = syntax_tree.parseString(st_data)
> from pprint import pprint
> pprint(results.asList())
>
> -- Paul
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>


Thank you so much Paul, Kent, and Hoftkamp.
I was asking what the right tools were, and I got two fully-functional
scripts back. Much more than I had expected.
I'm planning to use these scripts instead of the Perl one. I've also started
with PyParsing as it seems to be a little easier to understand than PLY.
 Thank you again,
-- 
لا أعرف مظلوما تواطأ الناس علي هضمه ولا زهدوا في إنصافه كالحقيقة.....محمد
الغزالي
"No victim has ever been more repressed and alienated than the truth"

Emad Soliman Nawfal
Indiana University, Bloomington
--------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090214/67dc4cb9/attachment.htm>


More information about the Tutor mailing list