partial parsing?

Andrew Dalke dalke at acm.org
Mon Apr 24 10:58:55 CEST 2000


Dave Abrahams:
>You might think about looking at chart parsing technology, sometimes
>used in natural language processing.


What I have about chart parsing is from Grune&Jacobs' "Parsing
Techniques".  They say that it's the same as CYK, but then elaborate
that variations of CYK are identical to variations of chart parsing.
Now if I only recall how CYK works.  Well, that's why I printed
out that document.  :)

>You can use a chart parser to extract all parseable sub-regions from a
>sequence of tokens. Just turn off the grammar rules you're not interested
in
>and you should be all set (provided the grammar rules you care about don't
>generate false positives due to the lack of contextual information that
>would be neccessary for a complete parse).

I don't quite follow that.  I think there's too much "self similarity"
in what I have because it's a large number of records in a row.  Again,
I've got to read more.

>Just a caveat: in typical program parsers much more time is spent in
lexical
>analysis than in actual parsing. This tends to indicate that using YACC is
>probably about as well as you'll do for speed and trying to optimize away
>parsing time is misguided.


I know in perl that my full parser code was about an order of magnitude
slower than the parse-only-exactly-what-I-want parser.

For a record like:
ID   BPT1_BOVIN     STANDARD;      PRT;   100 AA.
AC   P00974;
[...]
SQ   SEQUENCE   100 AA;  10903 MW;  6A778A4AD763FB19 CRC64;
     MKMSRLCLSV ALLVLLGTLA ASTPGCDTSN QAKAQRPDFC LEPPYTGPCK ARIIRYFYNA
     KAGLCQTFVY GGCRAKRNNF KSAEDCMRTC GGAIGPWENL
//

where I'm interested in the first word of the ID line and the sequence
lines, I should be able to have a (fake mxTools-like) implementation
like:

while not eof:
 skip 5 characters
 read until ' '
 read until 'S' and send read text to callback "id"
 skip 46 characters (the start of the '[')
 while text != "    " and text != "//":
   read up to '\n'
 while text != "//":
   while text == ' ':
     skip the ' '
     read up to ' ' or '\n' and send read text to callback "sq"
 skip 3 characters (the // followed by the \n)

So it seems what I want is better performance in the lexer part, which
agrees with your point.

                    Andrew
                    dalke at acm.org






More information about the Python-list mailing list