[Python-ideas] Proposed convenience functions for re module
Talin
talin at acm.org
Wed Jul 22 18:23:25 CEST 2009
Steven D'Aprano wrote:
> Following the thread "Experiment: Adding "re" to string objects.", I
> would like to propose the addition of two convenience functions to the
> re module:
>
>
> def multimatch(s, *patterns):
> """Do a re.match on s using each pattern in patterns,
> returning the first one to succeed, or None if they all fail."""
> for pattern in patterns:
> m = re.match(pattern, s)
> if m: return m
>
There's a cute trick that you can use to do this that is much more
efficient than testing each regex expression individually:
combined_pattern = "|".join("(%s)" % p for p in patterns)
combined_re = re.compile(combined_pattern)
m = combined_re.match(string)
return m.lastindex
Basically, it combines all of the patterns into a single large regex,
where each pattern is converted into a capturing group. It then returns
match.lastindex, which is the index of the capturing group that matched.
This is very efficient because now all of the patterns are combined into
a single NFA which can prune possibilities very quickly.
This works for up to 99 patterns, which is the limit on the number of
capturing groups that a regex can have.
I use this technique in my Python-based lexer, Reflex:
http://pypi.python.org/pypi/reflex/0.1
Now, if we are talking about convenience functions, what I would really
like to see is a class that wraps a string that allows matches to be
done incrementally, where each successful match consumes the head of the
string, leaving the remainder of the string for the next match.
This can be done very efficiently since the regex functions all take a
start-index parameter. Essentially, the wrapper class would update the
start index each time a successful match was performed.
So something like:
stream = MatchStream(string)
while 1:
m = stream.match(combined_re)
# m is a match object
# Do something with m
Or even an iterator over matches (this assumes that you want to use the
same regex each time, which may not be the case for a parser):
stream = MatchStream(string)
for m in stream.match(combined_re):
# m is a match object
# Do something with m
-- Talin
More information about the Python-ideas
mailing list