[Python-Dev] A standard lexer?
Tim Peters
tim_one@email.msn.com
Sun, 8 Oct 2000 06:32:36 -0400
Blast from the past!
[/F]
> for phrase, action in lexicon:
> p.append("(?:%s)(?P#%d)" % (phrase, len(p)))
[Tim]
> How about instead enhancing existing (?P<name>pattern) notation, to
> set a new match object attribute to name if & when pattern matches?
> Then arbitrary info associated with a named pattern can be gotten at
> via dicts via the pattern name, & the whole mess should be more
> readable.
[/F
Sent: Sunday, July 02, 2000 6:35 PM]
> I just added "lastindex" and "lastgroup" attributes to the match object.
>
> "lastindex" is the integer index of the last matched capturing group,
> "lastgroup" the corresponding name (or None, if the group didn't have
> a name). both attributes are None if no group were matched.
Reviewing this before 2.0 has been on my todo list for 3+ months, and
finally got to it. Good show! I converted some of my by-hand scanners to
use lastgroup, and like it a whole lot. I know you understand why this is
Good, so here's a simple example of an "after" tokenizer for those who don't
(this one happens to tokenize REXX-like PARSE stmts):
import re
_token = re.compile(r"""
(?P<space> \s+)
| (?P<var> [a-zA-Z_]\w*)
| (?P<dontcare> \.)
| (?P<number> \d+)
| (?P<punc> [-+=()])
| (?P<string> " [^"\\\n]* (?: \\. [^"\\\n]*)* "
| ' [^'\\\n]* (?: \\. [^'\\\n]*)* '
)
""", re.VERBOSE).match
del re
(T_SPACE,
T_VAR,
T_DONTCARE,
T_NUMBER,
T_PUNC,
T_STRING,
T_EOF,
) = range(7)
# For debug output.
_enum2name = ["T_SPACE",
"T_VAR",
"T_DONTCARE",
"T_NUMBER",
"T_PUNC",
"T_STRING",
"T_EOF",
]
_group2action = {
"space": (T_SPACE, None),
"var": (T_VAR, None),
"dontcare": (T_DONTCARE, None),
"number": (T_NUMBER, int),
"punc": (T_PUNC, None),
"string": (T_STRING, eval),
}
def tokenize(s, tokeneater):
i, n = 0, len(s)
while i < n:
m = _token(s, i)
if not m:
raise ParseError(s, i)
group = m.lastgroup
enum, action = _group2action[group]
val = m.group(group)
if action is not None:
val = action(val)
tokeneater(enum, val)
i = m.end()
tokeneater(T_EOF, None)
The tokenize function here used to be a mass of if/elif stmts trying to
figure out which group had matched. Now it's all table-driven: easier to
write, reuse & maintain, and quicker to boot. +1.
the-aged-may-be-slow-but-they-never-forget<wink>-ly y'rs - tim