Parsing

Phil Hunt philh at vision25.demon.co.uk
Tue May 4 10:59:56 EDT 1999


In article <372E4543.68A125EC at prescod.net>
           paul at prescod.net "Paul Prescod" writes:
> I am using Aycock's package to handle some parsing but I am having trouble
> because the language I am parsing is highly context sensitive. I don't
> have any trouble dealing with the context-sensitivity in the so-called
> "context free grammar" part of the package (the parser) but in the scanner
> it is killing me.
> 
> Let's pretend I am parsing a tagged (but non-SGML) language where there is
> an element "URL". Within "URL" elements, the characters < and > are
> illegal: they must be escaped as \< and \>.
> 
> Elsewhere they are not. Here is the grammar I would *like* to write
> (roughly):
> 
> Element ::= <URL> urlcontent </URL>
> urlcontent = (([^<>\/:]* ("\<"|"\>"|":"|"/"|"\\"))*
> Element ::= <NOT-A-URL> anychar* </NOT-A-URL>

I am currently writing a stream library for Python, which includes
an abstract PeekStream class which defines an abstract peek() method
allowing one to peek ahead of what's in the stream without reading 
it, and several methods implemented using this method, which perform
parsing (e.g. isNextSkip(str) -- which if the next characters in the 
input stream are (str), returns true and reads them, else returns
false and doesn't read them).

Your language would be processed something like this:

# use one of these to set up the PeekStream, depending on whether
# input comes from a string or file
ps = PeekString(someString)
ps = PeekFile(open('filename'))

while ps.hasMoreChars():
   if ps.isNextSkip('<URL>'):
      content = ps.readToAfter('</URL>')
      processUrlContent(content)
      continue
   if ps.isNextSkip('<')
      tagName = ps.readToAfter('>')
      contents = ps.readToAfter('</' + tagName + '>')
      processTag(tagName, contents)
      continue
   # ...stuff here to process if the next characters coming from the
   #    stream weren't '<URL>' or '<'...

> I could handle it if I could switch scanners mid-stream (for URL elements)
> but Aycock's scanner finishes up before the parser even gets under way!
> Should I scan and then parse (at a high level) and then rescan and reparse
> the URLs? Is there a package that allows me to mix the lexical and
> syntactic levels more?

My library is really aimed at lexical analysis rather than parsing,
although it could be used with a parser -- it is based on code I wrote
in C++ to code for a yylex() function to use with yacc because I got
fed up with trying to get lex to work.

-- 
Phil Hunt....philh at vision25.demon.co.uk





More information about the Python-list mailing list