Proposed convenience functions for re module

Following the thread "Experiment: Adding "re" to string objects.", I would like to propose the addition of two convenience functions to the re module: def multimatch(s, *patterns): """Do a re.match on s using each pattern in patterns, returning the first one to succeed, or None if they all fail.""" for pattern in patterns: m = re.match(pattern, s) if m: return m def multisearch(s, *patterns): """Do a re.search on s using each pattern in patterns, returning the first one to succeed, or None if they all fail.""" for pattern in patterns: m = re.search(pattern, s) if m: return m The rationale is to make the following idiom easier: m = re.match(s, pattern1) if not m: m = re.match(s, pattern2) if not m: m = re.match(s, pattern3) if not m: m = re.match(s, pattern4) if m: m.group() which will become: m = re.multimatch(s, pattern1, pattern2, pattern3, pattern4) if m: m.group() Is there any support or objections to this proposal? Any comments? -- Steven D'Aprano

Steven D'Aprano wrote:
One of the needs I've run across is to enable the program user (possibly a non-programmer) to do logical searches on data. It would be nice if the search patterns specified by the program user could be used directly by the functions. Search functions of this type would take patterns that are more like what you would use for google or yahoo searches instead of the more complex language re requires. Ron

On Wed, 22 Jul 2009 06:02:54 pm Ron Adam wrote:
I'm not sure if I understand this correctly. Perhaps you could give an example or two? Also, please don't overload my simple little proposal with a multitude of new functionality. My proposal is only meant to be a lightweight convenience function. Additional functionality probably belongs as a different function, maybe even a different module. -- Steven D'Aprano

Steven D'Aprano wrote:
Yes, it would be a different module and not added directly to the re module. While you are thinking of simplifying re for programmers, I'm thinking of simplified searches for users. A different target and purpose. I think your functions would make this idea easier to do. It would be nice if we could do simple logical searches where. [word1 word2] ;get results with either word1 or word2 [+word1 +word2] ;get results with both word1 and word2 [word1 -word2] ;get results with word1 and not with word2 ["word one" "word two"] ;use quotes to search for phrases And possibly use '*' and '?' as simple wild cards but keep it easy to use and simple. More complex searches should use the re module directly. This would act as a filter for lists and would be suitable for adding a *simple* user search capability to many scrips and applications. An example would be to enhance pydocs search of the summery lines. Currently if you type "modules key", if the key is multiple words, it only searches on the first word. You can not do searches on multiple words or exclude results with certain words. While we could allow regular expression input to work, for many applications it is overkill and it is too complex for many users. For example I would not like to try and teach my parents all the subtleties of regular expressions when they are struggling to understand a lot more basic things. They don't want to learn how to program computers, they just want to get a recipe that has [+chicken +"tomato sauce" -onions]. Ron

On Thu, Jul 23, 2009, Ron Adam wrote:
This sounds like a *great* addition to PyPI.... ;-) (That is, something like this is unlikely to make it into Python unless there's code that has seen uptake in the community.) -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "The volume of a pizza of thickness 'a' and radius 'z' is given by pi*z*z*a"

Steven D'Aprano wrote:
There's a cute trick that you can use to do this that is much more efficient than testing each regex expression individually: combined_pattern = "|".join("(%s)" % p for p in patterns) combined_re = re.compile(combined_pattern) m = combined_re.match(string) return m.lastindex Basically, it combines all of the patterns into a single large regex, where each pattern is converted into a capturing group. It then returns match.lastindex, which is the index of the capturing group that matched. This is very efficient because now all of the patterns are combined into a single NFA which can prune possibilities very quickly. This works for up to 99 patterns, which is the limit on the number of capturing groups that a regex can have. I use this technique in my Python-based lexer, Reflex: http://pypi.python.org/pypi/reflex/0.1 Now, if we are talking about convenience functions, what I would really like to see is a class that wraps a string that allows matches to be done incrementally, where each successful match consumes the head of the string, leaving the remainder of the string for the next match. This can be done very efficiently since the regex functions all take a start-index parameter. Essentially, the wrapper class would update the start index each time a successful match was performed. So something like: stream = MatchStream(string) while 1: m = stream.match(combined_re) # m is a match object # Do something with m Or even an iterator over matches (this assumes that you want to use the same regex each time, which may not be the case for a parser): stream = MatchStream(string) for m in stream.match(combined_re): # m is a match object # Do something with m -- Talin

Talin schrieb:
You might be interested in the undocumented re.Scanner class :) Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.

On 2009-07-22 19:49, Greg Ewing wrote:
http://mail.python.org/pipermail/python-dev/2003-April/035075.html -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

2009/7/23 Robert Kern <robert.kern@gmail.com>:
Question: Is there any reason (other than lack of time) why it's undocumented? I'd be willing to write some documentation, but only if it would stand a chance of being accepted - this isn't an itch of mine, so I don't want to spend ages arguing over whether the class should be documented. The source code says "experimental stuff (see python-dev discussions for details)". I've not searched the python-dev archives yet, but it seems to me that it'll never be anything other than experimental if people don't know it's there and try it out... Paul.

On 2009-07-23 11:50, Paul Moore wrote:
http://mail.python.org/pipermail/python-dev/2003-April/035070.html -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On 2009-07-23 16:48, Gerald Britton wrote:
I'm pretty sure that no one has worked on it. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

2009/7/22 Steven D'Aprano <steve@pearwood.info>:
I don't like it very much because it would only work for uncompiled patterns. All functions in re has a RegexObject counterpart, but multisearch() and multimatch() obviously would not. For the quoted example I'd usually try to create one regex that matches all four patterns, or use a loop: for pat in (pattern1, pattern2, pattern3, pattern4): m = re.match(s, pat) if m: m.group() break -- mvh Björn

On Thu, 23 Jul 2009 04:29:31 am BJörn Lindqvist wrote:
That's incorrect -- they accept pre-compiled regexes as well as strings.
Apart from being in a function, my proposal (which you claim to dislike) is virtually identical to that code (which you say you use). -- Steven D'Aprano

On Wed, Jul 22, 2009 at 1:00 AM, Steven D'Aprano<steve@pearwood.info> wrote:
Steven, could you show some examples of real(ish)-world use-cases for one or both of these functions? Preferably including the code that might directly follow a multimatch or multisearch call. It's probably because I haven't used regexes widely enough, but in all the potential examples I can come up with, either (1) the regexes are similar enough that they can be refactored into a single regex (e.g., just concatenated with '|'), or (2) they're distinct enough that each regex needs its own handing, so that the multimatch/multisearch would need to be followed by a multiway 'if/elif/else' anyway; in this case, it seems that little is gained. -- Mark

On Thu, 23 Jul 2009 07:38:11 am Mark Dickinson wrote:
I'm afraid that I don't use regexes anywhere near enough to champion this proposal in the face of serious opposition, or even skepticism. If this isn't a simple enough "no-brainer", then I'm going to have to pass the baton onto somebody else (assuming anyone actually likes this idea). This idea came about from the thread started by Sean Reifschneider, proposing adding regexes to strings. I thought (and Sean seemed to agree) that these convenience functions would solve his primary use-case. So this proposal isn't scratching an itch I have.
These are both reasonable approaches. This proposal isn't supposed to solve every multiple-regex-handling problem. So far support for this has been luke-warm. If anyone really likes this idea, please speak up, otherwise I'll let it drop. -- Steven D'Aprano

22-07-2009, 02:00 Steven D'Aprano <steve@pearwood.info>:
It sounds nice. But why not to use simply: m = re.match(s, '|'.join(pattern1, pattern2, pattern3, pattern4)) And if we want the feature anyway, I'd prefer MRAB's:
*** But if we are talking about convenience functions in re module, it'd be IMHO very nice to have such functions: def matchgrouping(pattern, string, flags=0, default=None): """Do a re.match on string using pattern, returning dict containing groups which could be got by index or by name.""" match = re.match(pattern, string, flags) groups = collections.DefaultDict() groups.update(enumerate(match.groups())) groups.update(match.groupdict()) return result Plus the analogous function for searching). Plus 2 analogous methods of RegexObject instances). * Then e.g. -- instead of: m = re.search(pattern, s) if m: first_group = m.group(1) surname = m.group('surname') else: first_group = None surname = None -- we could write simply: m = re.matchgrouping(pattern, s) first_group = m[1] surname = m['surname'] * And e.g. -- instead of: withip = log_re.match(logline) if withip and withip.group('ip_addr'): iplog.append(logline) -- we could write simply: if log_re.matchgrouping(logline)['ip_addr']: iplog.append(logline) What do you think about it? *j -- Jan Kaliszewski (zuo) <zuo@chopin.edu.pl>

Jan Kaliszewski <zuo@chopin.edu.pl> wrote:
It sounds nice. But why not to use simply:
m = re.match(s, '|'.join(pattern1, pattern2, pattern3, pattern4))
Sorry, I ment of course: m = re.match('|'.join(pattern1, pattern2, pattern3, pattern4), s) ***
I ment: "...returning collections.DefaultDict..." (as you can see in the code following). Regards, *j -- Jan Kaliszewski <zuo@chopin.edu.pl>

Steven D'Aprano wrote:
One of the needs I've run across is to enable the program user (possibly a non-programmer) to do logical searches on data. It would be nice if the search patterns specified by the program user could be used directly by the functions. Search functions of this type would take patterns that are more like what you would use for google or yahoo searches instead of the more complex language re requires. Ron

On Wed, 22 Jul 2009 06:02:54 pm Ron Adam wrote:
I'm not sure if I understand this correctly. Perhaps you could give an example or two? Also, please don't overload my simple little proposal with a multitude of new functionality. My proposal is only meant to be a lightweight convenience function. Additional functionality probably belongs as a different function, maybe even a different module. -- Steven D'Aprano

Steven D'Aprano wrote:
Yes, it would be a different module and not added directly to the re module. While you are thinking of simplifying re for programmers, I'm thinking of simplified searches for users. A different target and purpose. I think your functions would make this idea easier to do. It would be nice if we could do simple logical searches where. [word1 word2] ;get results with either word1 or word2 [+word1 +word2] ;get results with both word1 and word2 [word1 -word2] ;get results with word1 and not with word2 ["word one" "word two"] ;use quotes to search for phrases And possibly use '*' and '?' as simple wild cards but keep it easy to use and simple. More complex searches should use the re module directly. This would act as a filter for lists and would be suitable for adding a *simple* user search capability to many scrips and applications. An example would be to enhance pydocs search of the summery lines. Currently if you type "modules key", if the key is multiple words, it only searches on the first word. You can not do searches on multiple words or exclude results with certain words. While we could allow regular expression input to work, for many applications it is overkill and it is too complex for many users. For example I would not like to try and teach my parents all the subtleties of regular expressions when they are struggling to understand a lot more basic things. They don't want to learn how to program computers, they just want to get a recipe that has [+chicken +"tomato sauce" -onions]. Ron

On Thu, Jul 23, 2009, Ron Adam wrote:
This sounds like a *great* addition to PyPI.... ;-) (That is, something like this is unlikely to make it into Python unless there's code that has seen uptake in the community.) -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "The volume of a pizza of thickness 'a' and radius 'z' is given by pi*z*z*a"

Steven D'Aprano wrote:
There's a cute trick that you can use to do this that is much more efficient than testing each regex expression individually: combined_pattern = "|".join("(%s)" % p for p in patterns) combined_re = re.compile(combined_pattern) m = combined_re.match(string) return m.lastindex Basically, it combines all of the patterns into a single large regex, where each pattern is converted into a capturing group. It then returns match.lastindex, which is the index of the capturing group that matched. This is very efficient because now all of the patterns are combined into a single NFA which can prune possibilities very quickly. This works for up to 99 patterns, which is the limit on the number of capturing groups that a regex can have. I use this technique in my Python-based lexer, Reflex: http://pypi.python.org/pypi/reflex/0.1 Now, if we are talking about convenience functions, what I would really like to see is a class that wraps a string that allows matches to be done incrementally, where each successful match consumes the head of the string, leaving the remainder of the string for the next match. This can be done very efficiently since the regex functions all take a start-index parameter. Essentially, the wrapper class would update the start index each time a successful match was performed. So something like: stream = MatchStream(string) while 1: m = stream.match(combined_re) # m is a match object # Do something with m Or even an iterator over matches (this assumes that you want to use the same regex each time, which may not be the case for a parser): stream = MatchStream(string) for m in stream.match(combined_re): # m is a match object # Do something with m -- Talin

Talin schrieb:
You might be interested in the undocumented re.Scanner class :) Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.

On 2009-07-22 19:49, Greg Ewing wrote:
http://mail.python.org/pipermail/python-dev/2003-April/035075.html -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

2009/7/23 Robert Kern <robert.kern@gmail.com>:
Question: Is there any reason (other than lack of time) why it's undocumented? I'd be willing to write some documentation, but only if it would stand a chance of being accepted - this isn't an itch of mine, so I don't want to spend ages arguing over whether the class should be documented. The source code says "experimental stuff (see python-dev discussions for details)". I've not searched the python-dev archives yet, but it seems to me that it'll never be anything other than experimental if people don't know it's there and try it out... Paul.

On 2009-07-23 11:50, Paul Moore wrote:
http://mail.python.org/pipermail/python-dev/2003-April/035070.html -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On 2009-07-23 16:48, Gerald Britton wrote:
I'm pretty sure that no one has worked on it. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

2009/7/22 Steven D'Aprano <steve@pearwood.info>:
I don't like it very much because it would only work for uncompiled patterns. All functions in re has a RegexObject counterpart, but multisearch() and multimatch() obviously would not. For the quoted example I'd usually try to create one regex that matches all four patterns, or use a loop: for pat in (pattern1, pattern2, pattern3, pattern4): m = re.match(s, pat) if m: m.group() break -- mvh Björn

On Thu, 23 Jul 2009 04:29:31 am BJörn Lindqvist wrote:
That's incorrect -- they accept pre-compiled regexes as well as strings.
Apart from being in a function, my proposal (which you claim to dislike) is virtually identical to that code (which you say you use). -- Steven D'Aprano

On Wed, Jul 22, 2009 at 1:00 AM, Steven D'Aprano<steve@pearwood.info> wrote:
Steven, could you show some examples of real(ish)-world use-cases for one or both of these functions? Preferably including the code that might directly follow a multimatch or multisearch call. It's probably because I haven't used regexes widely enough, but in all the potential examples I can come up with, either (1) the regexes are similar enough that they can be refactored into a single regex (e.g., just concatenated with '|'), or (2) they're distinct enough that each regex needs its own handing, so that the multimatch/multisearch would need to be followed by a multiway 'if/elif/else' anyway; in this case, it seems that little is gained. -- Mark

On Thu, 23 Jul 2009 07:38:11 am Mark Dickinson wrote:
I'm afraid that I don't use regexes anywhere near enough to champion this proposal in the face of serious opposition, or even skepticism. If this isn't a simple enough "no-brainer", then I'm going to have to pass the baton onto somebody else (assuming anyone actually likes this idea). This idea came about from the thread started by Sean Reifschneider, proposing adding regexes to strings. I thought (and Sean seemed to agree) that these convenience functions would solve his primary use-case. So this proposal isn't scratching an itch I have.
These are both reasonable approaches. This proposal isn't supposed to solve every multiple-regex-handling problem. So far support for this has been luke-warm. If anyone really likes this idea, please speak up, otherwise I'll let it drop. -- Steven D'Aprano

22-07-2009, 02:00 Steven D'Aprano <steve@pearwood.info>:
It sounds nice. But why not to use simply: m = re.match(s, '|'.join(pattern1, pattern2, pattern3, pattern4)) And if we want the feature anyway, I'd prefer MRAB's:
*** But if we are talking about convenience functions in re module, it'd be IMHO very nice to have such functions: def matchgrouping(pattern, string, flags=0, default=None): """Do a re.match on string using pattern, returning dict containing groups which could be got by index or by name.""" match = re.match(pattern, string, flags) groups = collections.DefaultDict() groups.update(enumerate(match.groups())) groups.update(match.groupdict()) return result Plus the analogous function for searching). Plus 2 analogous methods of RegexObject instances). * Then e.g. -- instead of: m = re.search(pattern, s) if m: first_group = m.group(1) surname = m.group('surname') else: first_group = None surname = None -- we could write simply: m = re.matchgrouping(pattern, s) first_group = m[1] surname = m['surname'] * And e.g. -- instead of: withip = log_re.match(logline) if withip and withip.group('ip_addr'): iplog.append(logline) -- we could write simply: if log_re.matchgrouping(logline)['ip_addr']: iplog.append(logline) What do you think about it? *j -- Jan Kaliszewski (zuo) <zuo@chopin.edu.pl>

Jan Kaliszewski <zuo@chopin.edu.pl> wrote:
It sounds nice. But why not to use simply:
m = re.match(s, '|'.join(pattern1, pattern2, pattern3, pattern4))
Sorry, I ment of course: m = re.match('|'.join(pattern1, pattern2, pattern3, pattern4), s) ***
I ment: "...returning collections.DefaultDict..." (as you can see in the code following). Regards, *j -- Jan Kaliszewski <zuo@chopin.edu.pl>
participants (14)
-
Aahz
-
BJörn Lindqvist
-
Gabriel Genellina
-
Georg Brandl
-
Gerald Britton
-
Greg Ewing
-
Jan Kaliszewski
-
Mark Dickinson
-
MRAB
-
Paul Moore
-
Robert Kern
-
Ron Adam
-
Steven D'Aprano
-
Talin