[Python-Dev] Re: CML2 compiler slowness

Tim Peters tim.one@home.com
Mon, 12 Mar 2001 20:14:34 -0500


FYI, Fredrik's regexp engine also supports two undocumented match-object
attributes that could be used to speed SPARK lexing, and especially when
there are many token types (gives a direct index to the matching alternative
instead of making you do a linear search for it -- that can add up to a major
win).  Simple example below.

Python-Dev, this has been in there since 2.0 (1.6?  unsure).  I've been using
it happily all along.  If Fredrik is agreeable, I'd like to see this
documented for 2.1, i.e. made an officially supported part of Python's regexp
facilities.

-----Original Message-----
From: Tim Peters [mailto:tim.one@home.com]
Sent: Monday, March 12, 2001 6:37 PM
To: python-list@python.org
Subject: RE: Help with Regular Expressions

[Raymond Hettinger]
> Is there an idiom for how to use regular expressions for lexing?
>
> My attempt below is unsatisfactory because it has to filter the
> entire match group dictionary to find-out which token caused
> the match. This approach isn't scalable because every token
> match will require a loop over all possible token types.
>
> I've fiddled with this one for hours and can't seem to find a
> direct way get a group dictionary that contains only matches.

That's because there isn't a direct way; best you can do now is seek to order
your alternatives most-likely first (which is a good idea anyway, given the
way the engine works).

If you peek inside sre.py (2.0 or later), you'll find an undocumented class
Scanner that uses the undocumented .lastindex attribute of match objects.
Someday I hope this will be the basis for solving exactly the problem you're
facing.  There's also an undocumented .lastgroup attribute:

Python 2.1b1 (#11, Mar  2 2001, 11:23:29) [MSC 32 bit (Intel)] on win32
Type "copyright", "credits" or "license" for more information.
IDLE 0.6 -- press F1 for help
>>> import re
>>> pat = re.compile(r"(?P<a>aa)|(?P<b>bb)")
>>> m = pat.search("baab")
>>> m.lastindex  # numeral of group that matched
1
>>> m.lastgroup  # name of group that matched
'a'
>>> m = pat.search("ababba")
>>> m.lastindex
2
>>> m.lastgroup
'b'
>>>

They're not documented yet because we're not yet sure whether we want to make
them permanent parts of the language.  So feel free to play, but don't count
on them staying around forever.  If you like them, drop a note to the effbot
saying so.

for-more-docs-read-the-source-code-ly y'rs  - tim