[Types-sig] RE: [String-SIG] Python parser in Python?

Mon, 20 Dec 1999 17:18:51 -0500

I can only make time for one easy one, and ... lessee ... Paul wins!

[Tim]
> John Aycock's ... framework comes with a Python grammar.

[Paul Prescod]
> It depends on Python's built-in lexer:
>
> #
> #  Why would I write my own when GvR maintains this one?
> #
> import tokenize
>
> Doesn't that remove the possibility for new keywords?

I'm going to respond a little more than John did, because tokenize.py has a
funky API that takes some getting used to.  Run the attached, and things
will be clearer.  tokenize.py doesn't know about keywords per se; all
alphanumeric names (whether keyword or identifier) come back with the NAME
token type.  Deciding what's a keyword is a post-lexing decision (i.e.,
that's up to tokenize's caller).

So unless the Types-SIG decides to prototype syntax unreasonably different
from current Python's, the only likely way in which tokenize.py may need to
be altered is in extending its Operator regexp.  For example, the "->" in
the attached is tokenized as two distinct OP tokens, "-" and ">".  You can
easily live with that by defining a *grammar* production to recognize that
pair, but then you can't stop e.g. "-      >" from getting treated as "->"
too (tokenize suppresses intraline whitespace).  Good enough for a
prototype!  Note that "-" followed by ">" is never legit Python today.

Subtleties for tokenize newbies:  a NEWLINE token terminates a stmt.  An NL
token is produced for an *intra*-stmt newline (NL does not terminate a stmt;
you can usually ignore NL, and COMMENT, tokens).  Changes in nesting level
are signaled by INDENT and DEDENT tokens.  Watch out for files whose final
line is indented but doesn't end with \n (that's the only time you'll see a
sequence of DEDENT tokens not immediately preceded by a NEWLINE token); Mark
Hammond has no other kind of file <wink>.

I'll be back next year, if not next week.  Americans should leave cookies
and milk out for Santa and his reindeer; people in other countries should
set deadly traps for evil goat gods -- or whatever other foolishness they
believe in.

and-remember-that-whoever-writes-code-first-wins<wink>-l y'rs  - tim

import tokenize

class TokDemo:
    def __init__(self, file):
        self.f = file

    def run(self):
        tokenize.tokenize(self.f.readline, self.gobbler)

    def gobbler(self, ttype, token, (sline, scol), (eline, ecol), line):
        print tokenize.tok_name[ttype], `token`

example = """
    def rootlist(n: Int, r: Real) -> [Real]:
        decl var result: [Real]
        result = []
        decl var i: Int
        for i in range(n):
            result.append(i ** (1/r))
        return result
"""

import StringIO
d = TokDemo(StringIO.StringIO(example))
d.run()