
Steven D'Aprano wrote:
Following the thread "Experiment: Adding "re" to string objects.", I would like to propose the addition of two convenience functions to the re module:
def multimatch(s, *patterns): """Do a re.match on s using each pattern in patterns, returning the first one to succeed, or None if they all fail.""" for pattern in patterns: m = re.match(pattern, s) if m: return m
There's a cute trick that you can use to do this that is much more efficient than testing each regex expression individually: combined_pattern = "|".join("(%s)" % p for p in patterns) combined_re = re.compile(combined_pattern) m = combined_re.match(string) return m.lastindex Basically, it combines all of the patterns into a single large regex, where each pattern is converted into a capturing group. It then returns match.lastindex, which is the index of the capturing group that matched. This is very efficient because now all of the patterns are combined into a single NFA which can prune possibilities very quickly. This works for up to 99 patterns, which is the limit on the number of capturing groups that a regex can have. I use this technique in my Python-based lexer, Reflex: http://pypi.python.org/pypi/reflex/0.1 Now, if we are talking about convenience functions, what I would really like to see is a class that wraps a string that allows matches to be done incrementally, where each successful match consumes the head of the string, leaving the remainder of the string for the next match. This can be done very efficiently since the regex functions all take a start-index parameter. Essentially, the wrapper class would update the start index each time a successful match was performed. So something like: stream = MatchStream(string) while 1: m = stream.match(combined_re) # m is a match object # Do something with m Or even an iterator over matches (this assumes that you want to use the same regex each time, which may not be the case for a parser): stream = MatchStream(string) for m in stream.match(combined_re): # m is a match object # Do something with m -- Talin