dalke at acm.org
Mon Apr 24 10:58:55 CEST 2000
>You might think about looking at chart parsing technology, sometimes
>used in natural language processing.
What I have about chart parsing is from Grune&Jacobs' "Parsing
Techniques". They say that it's the same as CYK, but then elaborate
that variations of CYK are identical to variations of chart parsing.
Now if I only recall how CYK works. Well, that's why I printed
out that document. :)
>You can use a chart parser to extract all parseable sub-regions from a
>sequence of tokens. Just turn off the grammar rules you're not interested
>and you should be all set (provided the grammar rules you care about don't
>generate false positives due to the lack of contextual information that
>would be neccessary for a complete parse).
I don't quite follow that. I think there's too much "self similarity"
in what I have because it's a large number of records in a row. Again,
I've got to read more.
>Just a caveat: in typical program parsers much more time is spent in
>analysis than in actual parsing. This tends to indicate that using YACC is
>probably about as well as you'll do for speed and trying to optimize away
>parsing time is misguided.
I know in perl that my full parser code was about an order of magnitude
slower than the parse-only-exactly-what-I-want parser.
For a record like:
ID BPT1_BOVIN STANDARD; PRT; 100 AA.
SQ SEQUENCE 100 AA; 10903 MW; 6A778A4AD763FB19 CRC64;
MKMSRLCLSV ALLVLLGTLA ASTPGCDTSN QAKAQRPDFC LEPPYTGPCK ARIIRYFYNA
KAGLCQTFVY GGCRAKRNNF KSAEDCMRTC GGAIGPWENL
where I'm interested in the first word of the ID line and the sequence
lines, I should be able to have a (fake mxTools-like) implementation
while not eof:
skip 5 characters
read until ' '
read until 'S' and send read text to callback "id"
skip 46 characters (the start of the '[')
while text != " " and text != "//":
read up to '\n'
while text != "//":
while text == ' ':
skip the ' '
read up to ' ' or '\n' and send read text to callback "sq"
skip 3 characters (the // followed by the \n)
So it seems what I want is better performance in the lexer part, which
agrees with your point.
dalke at acm.org
More information about the Python-list