[Python-Dev] A standard lexer?

Sun, 8 Oct 2000 06:32:36 -0400

Blast from the past!

[/F]
>         for phrase, action in lexicon:
>             p.append("(?:%s)(?P#%d)" % (phrase, len(p)))

[Tim]
> How about instead enhancing existing (?P<name>pattern) notation, to
> set a new match object attribute to name if & when pattern matches?
> Then arbitrary info associated with a named pattern can be gotten at
> via dicts via the pattern name, & the whole mess should be more
> readable.

[/F
 Sent: Sunday, July 02, 2000 6:35 PM]
> I just added "lastindex" and "lastgroup" attributes to the match object.
>
> "lastindex" is the integer index of the last matched capturing group,
> "lastgroup" the corresponding name (or None, if the group didn't have
> a name).  both attributes are None if no group were matched.

Reviewing this before 2.0 has been on my todo list for 3+ months, and
finally got to it.  Good show!  I converted some of my by-hand scanners to
use lastgroup, and like it a whole lot.  I know you understand why this is
Good, so here's a simple example of an "after" tokenizer for those who don't
(this one happens to tokenize REXX-like PARSE stmts):

import re
_token = re.compile(r"""
        (?P<space> \s+)
    |   (?P<var> [a-zA-Z_]\w*)
    |   (?P<dontcare> \.)
    |   (?P<number> \d+)
    |   (?P<punc> [-+=()])
    |   (?P<string> " [^"\\\n]* (?: \\. [^"\\\n]*)* "
        |           ' [^'\\\n]* (?: \\. [^'\\\n]*)* '
        )
""", re.VERBOSE).match
del re

(T_SPACE,
 T_VAR,
 T_DONTCARE,
 T_NUMBER,
 T_PUNC,
 T_STRING,
 T_EOF,
)           = range(7)

# For debug output.
_enum2name = ["T_SPACE",
              "T_VAR",
              "T_DONTCARE",
              "T_NUMBER",
              "T_PUNC",
              "T_STRING",
              "T_EOF",
             ]

_group2action = {
    "space":    (T_SPACE, None),
    "var":      (T_VAR, None),
    "dontcare": (T_DONTCARE, None),
    "number":   (T_NUMBER, int),
    "punc":     (T_PUNC, None),
    "string":   (T_STRING, eval),
}

def tokenize(s, tokeneater):
    i, n = 0, len(s)
    while i < n:
        m = _token(s, i)
        if not m:
            raise ParseError(s, i)
        group = m.lastgroup
        enum, action = _group2action[group]
        val = m.group(group)
        if action is not None:
            val = action(val)
        tokeneater(enum, val)
        i = m.end()
    tokeneater(T_EOF, None)

The tokenize function here used to be a mass of if/elif stmts trying to
figure out which group had matched.  Now it's all table-driven:  easier to
write, reuse & maintain, and quicker to boot.  +1.

the-aged-may-be-slow-but-they-never-forget<wink>-ly y'rs  - tim