Allow regex group name redefinitions
Hello everybody, I've got a suggestion for the std. re module developers: to consider allowing match group name redefinitions, especially in alternatives. While you may not see the point at first glance, let me try to reason such a thing using a real-world example from my practice: Imagine a company, using certain codes for their products/product components (unfortunately, I'm not at liberty to disclose the true nature of them, but bare with me). Let's say that they may have the following forms: r"(?P<type>AB|C|D)[- ](?P<prefix>[A-Z])?(?P<number>\d+)(?P<postfix>[A-Z])?" So far so good. But now, imagine that a particular type of code has a bit different syntax: r"(?P<type>E)[- ](?P<prefix>[A-Za-z])?(?P<number>\d+)[- ](?P<postfix>[A-Za-z])?" As you can see, the prefix & postfix may be lowercase in case of code type E and moreover, a space or dash is required before the postfix. If I merged the definitions, I'd have to allow that syntax even for the AB, C and D code types---but that would've been incorrect and would require post-matching checks. Ideally I'd like to have the opportunity to define the regex as an alternative: r"(?P<type>AB|C|D)[- ](?P<prefix>[A-Z])?(?P<number>\d+)(?P<postfix>[A-Z])?|(?P<type>E)[- ](?P<prefix>[A-Za-z])?(?P<number>\d+)[- ](?P<postfix>[A-Za-z])?" I can't, of course, getting the re.error: redefinition of group name error upon the regex compilation. But is that really a problem, especially in such alternatives? If you imagine the regex as a FSA, the code type branches into completely independent sub-trees of the automaton state transitions. There's no problem with efficiency; the regex might look a bit complex, but the matching is perfectly efficient---definitely more so than if I match multiple expressions. The redefinition of the match group names is IMO technically perfectly possible and note that in such alternatives, re-assignments won't really happen. And finally, even if they would happen, what's the problem with that? Might be a logical error in the regex definition of course, but that's the programmer's lookout in general... So what do you think? If the match group name redefinition was allowed, I could just match a single regex, getting match group dict and read out parsed parts of the codes by name---nice and easy. Currently, my 2 choices are: 1/ Use uniquely named groups, which requires me to do a post-match group name consolidation of sort or 2/ Match multiple reg. expressions, which is unnecessary Therefore, I ask you to reconsider issuing the error, which I deem redundant and unnecessarily limiting a justified use-case, IMO. Also note that doing that won't break any old code---anything that worked before will continue to work with unchanged semantics; so such a change would be perfectly safe. Thanks, Best Regards vasek
On 2 Oct 2021, at 10:27, vencik@razdva.cz wrote:
Hello everybody,
I've got a suggestion for the std. re module developers: to consider allowing match group name redefinitions, especially in alternatives. While you may not see the point at first glance, let me try to reason such a thing using a real-world example from my practice:
Imagine a company, using certain codes for their products/product components (unfortunately, I'm not at liberty to disclose the true nature of them, but bare with me). Let's say that they may have the following forms: r"(?P<type>AB|C|D)[- ](?P<prefix>[A-Z])?(?P<number>\d+)(?P<postfix>[A-Z])?"
So far so good. But now, imagine that a particular type of code has a bit different syntax: r"(?P<type>E)[- ](?P<prefix>[A-Za-z])?(?P<number>\d+)[- ](?P<postfix>[A-Za-z])?"
As you can see, the prefix & postfix may be lowercase in case of code type E and moreover, a space or dash is required before the postfix. If I merged the definitions, I'd have to allow that syntax even for the AB, C and D code types---but that would've been incorrect and would require post-matching checks.
Ideally I'd like to have the opportunity to define the regex as an alternative: r"(?P<type>AB|C|D)[- ](?P<prefix>[A-Z])?(?P<number>\d+)(?P<postfix>[A-Z])?|(?P<type>E)[- ](?P<prefix>[A-Za-z])?(?P<number>\d+)[- ](?P<postfix>[A-Za-z])?"
I can't, of course, getting the re.error: redefinition of group name error upon the regex compilation.
But is that really a problem, especially in such alternatives? If you imagine the regex as a FSA, the code type branches into completely independent sub-trees of the automaton state transitions. There's no problem with efficiency; the regex might look a bit complex, but the matching is perfectly efficient---definitely more so than if I match multiple expressions. The redefinition of the match group names is IMO technically perfectly possible and note that in such alternatives, re-assignments won't really happen. And finally, even if they would happen, what's the problem with that? Might be a logical error in the regex definition of course, but that's the programmer's lookout in general...
So what do you think? If the match group name redefinition was allowed, I could just match a single regex, getting match group dict and read out parsed parts of the codes by name---nice and easy. Currently, my 2 choices are: 1/ Use uniquely named groups, which requires me to do a post-match group name consolidation of sort or 2/ Match multiple reg. expressions, which is unnecessary
Therefore, I ask you to reconsider issuing the error, which I deem redundant and unnecessarily limiting a justified use-case, IMO. Also note that doing that won't break any old code---anything that worked before will continue to work with unchanged semantics; so such a change would be perfectly safe.
Faced with this problem I would write a parser for the product codes that understands the syntax and break it into pieces that make sense. I would not use regex in the parser. Barry
Thanks,
Best Regards
vasek _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/K2FXXQ... Code of Conduct: http://python.org/psf/codeofconduct/
participants (3)
-
Barry Scott
-
MRAB
-
vencik@razdva.cz