[Tutor] parsing--is this right?

Paul Tremblay phthenry@earthlink.net
Tue, 11 Jun 2002 14:55:47 -0400


On Tue, Jun 11, 2002 at 10:19:00AM -0700, Danny Yoo wrote:

> 
> Actually, the parse() function is meant to be standalone: it's not a
> method.  I'm only using Chunk()  to group the data together, and to make
> it easier to extract the command type later on.

Ah, I missed these lines in your original code:

new_chunk = Chunk('text', tokens.pop(0))
         return new_chunk

new_chunk is your object. Yes, the parser by itself is not part
of the Chunk class. 

(Part of the confusion has to do with how indentation marks
different parts of code in python, a feature that first caused me
to go "yuck," but one which I now love. When reading someone
else's code, there is a tendancy to miss what is nested.)

One of the confusing things for me is that in your code, a
function calls itself. This makes me really think. I doubt there
is another way to write this code, since the text being
tokenized has a nested structure. Let me repeat: this makes me
think! It is more complicated than the normal way I program
which is to  get a chunk of text, send it to some subroutine,
get another chunk, send it to another subroutine. Nonetheless,
looking at the code, I see exactly how it works.

>
> The rtf parser I wrote only recognizes two categories of tokens: the
> beginning of boundaries (brackets "{}"), and everything else.  Since it
> groups these tokens into those two categories, it doesn't have to worry
> about newlines.  That's why a lot of parsers are paired together with
> tokenizers, so that the parser can avoid thinking about content, and
> concentrate more on categories.
> 

Okay, so from here how would you change this text? I would want
it to look like this:

<footnote> <emph>an italicized word</emph> <emph>maybe another
italicized word</emph> text </footnote>


> 
> > (1)I don't understand how this method continues to read each item in the
> > list.


I understood how pop worked, but I didn't see how your code kept
using pop, because I didn't see how the routine called on itself.

Last question: How doe lexers (or parsers?) like plex or SPARK
fit into our model? In actuality, I simlified my problem, though
not by too much. A more represenative implementation of rtf colde 
looks like this:

1 \pard\plain This is normal text
2 \par
3 \pard\plain  
4 \pard\plain \s1\fi720 this text has an assigned style"\par


5 \pard\plain Now a footnote. As Schmoo said "Be 
6 great!"{\fs18\up6 \chftn {\footnote \pard\plain \s246 \fs20 
7 {\fs18\up6 \chftn }footnote at bottom of page}}\par

The above text uses the delimeters "\pard" to mark a paragraph
style. If the word documents has used a style, then the next bit
of information, the \s1 found in the fourth line, tells what
typye of informatin the paragraph contains. For example, in
microsoft word, you might have called all paragraphs that indent
1 inch a "block" style. You might call all paragrahs that contain
quotes a "quote" style. Microsoft word lables them \s1, and \s2,
or "style 1" , "style2."

Hence, the TYPE now becomes \pard \s1. 

We need to know when this style ends and a new one begins. Rtf
says that a new style begins when you encounter another \pard
delimter. There is an exception: if this \pard delimeter is
within a footnote, the previous style has not terminated.

So our model becomes more complicated. However, the plex lexer
would handle this simplified model easily. As you probably know
from using SPARK, you can set states in the lexer. Hence, once
the lexer is in the "footnote" state, it treats the delimeter
\pard totally different than it would if it found it outside of
the state.

I'm not necessarily looking for a specific solution here; I am
just trying to learn about parsers.

Thanks for your help!

Paul

-- 

************************
*Paul Tremblay         *
*phthenry@earthlink.net*
************************