[Paul Prescod]
As an aside: I would be pumped about getting a generic lexer into the Python distribution.
[Fredrik Lundh]
how about this quick and dirty proposal:
- add a new primitive to SRE: (?P#n), where n is a small integer. this primitive sets the match object's "index" variable to n when the engine stumbles upon it.
Note that the lack of "something like this" is one of the real barriers to speeding SPARK's lexing, and the speed of a SPARK lexer now (well, last I looked into this) can be wildly dependent on the order in which you define your lexing methods (partly because there's no way to figure out which lexing method matched without iterating through all the groups to find the first that isn't None). The same kind of irritating iteration is needed in IDLE and pyclbr too (disguised as unrolled if/elif/elif/... chains), and in tokenize.py (there *really* disguised in a convoluted way, by doing more string tests on the matched substring to *infer* which of the regexp pattern chunks must have matched). OTOH, arbitrary small integers are not Pythonic. Your example *generates* them in order to guarantee they're unique, which is a bad sign (it implies users can't do this safely by hand, and I believe that's the truth of it too):
for phrase, action in lexicon: p.append("(?:%s)(?P#%d)" % (phrase, len(p)))
How about instead enhancing existing (?P<name>pattern) notation, to set a new match object attribute to name if & when pattern matches? Then arbitrary info associated with a named pattern can be gotten at via dicts via the pattern name, & the whole mess should be more readable. On the third hand, I'm really loathe to add more gimmicks to stinking regexps. But, on the fourth hand, no alternative yet has proven popular enough to move away from those suckers. if-you-can't-get-a-new-car-at-least-tune-up-the-old-one-ly y'rs - tim