Add regex pattern literal p""

We can use this literal to represent a compiled pattern, for example:
p"(?i)[a-z]".findall("a1B2c3") ['a', 'B', 'c']
rp'\W+'.split('Words, words, words.') ['Words', 'words', 'words', '']
This allows peephole optimizer to store compiled pattern in .pyc file, we can get performance optimization like replacing constant set by frozenset in .pyc file. Then such issue [1] can be solved perfectly. [1] Optimize base64.b16decode to use compiled regex [1] https://bugs.python.org/issue35559 Two shortcomings: 1, Elevating a class in a module (re.Pattern) to language level, this sounds not very natural. This makes Python looks like Perl. 2, We can't use regex module as a drop-in replacement: import regex as re IMHO, I would like to see regex module be adopted into stdlib after cutting off its "full case-folding" and "fuzzy matching" features. Related links: [2] Chris Angelico conceived of "compiled regexes be stored in .pyc file" in March 2013. [2] https://mail.python.org/pipermail/python-ideas/2013-March/020043.html [3] Ken Hilton conceived of "Give regex operations more sugar" in June 2018. [3] https://mail.python.org/pipermail/python-ideas/2018-June/051395.html

On Thu, Dec 27, 2018 at 10:49 PM Ma Lin <malincns@163.com> wrote:
Before discussing something specific like regex literal syntax, I would love to see a way to measure that sort of performance difference. Does anyone here have MacroPy experience or something and could mock something up that would precompile and save a regex? In theory, it would be possible to tag ANY value as "constant once evaluated" and have it saved in the pyc. It'd be good to know just how much benefit this precompilation actually grants.
Wow that's an old post of mine :) ChrisA

Ma Lin schrieb am 27.12.18 um 14:15:
That's from the external regex package, not the stdlib re module.
Look a little closer:
What this does, essentially, is to make the pickle loader pass the original regex pattern string into re.compile() to "unpickle" it. Meaning, it compiles the regex on the way in. Thus, there isn't much to gain from using (the current form of) regex pickling here. I'm not saying that this can't be changed, but personally, this is exactly what I would do if I was asked to make a compiled regex picklable. Everything else would probably get you into portability hell. Stefan

Reply to Stefan Behnel and Chris Angelico. On 18-12-27 22:42, Stefan Behnel wrote:
Yes, re module only pickles pattern string and flags, it's safe for cross-version pickle/unpickle. re module's pickle code: def _pickle(p): return _compile, (p.pattern, p.flags) copyreg.pickle(Pattern, _pickle, _compile) On 18-12-28 1:27, Chris Angelico wrote:
I had followed regex module for a year, it does pickle the compiled data, this is its code: def _pickle(pattern): return _regex.compile, pattern._pickled_data _copy_reg.pickle(Pattern, _pickle) // in _regex.c file self->pickled_data = Py_BuildValue("OnOOOOOnOnn", pattern, flags, code_list, groupindex, indexgroup, named_lists, named_list_indexes, req_offset, required_chars, req_flags, public_group_count); if (!self->pickled_data) { Py_DECREF(self); return NULL; }

On Fri, Dec 28, 2018 at 12:15 AM Ma Lin <malincns@163.com> wrote:
What Stefan pointed out regarding the stdlib's "re" module is also true of the third party "regex" - unpickling just compiles from the original string. Regarding pyc files, though, pickle is less significant than marshal. And both re.compile() and regex.compile() return unmarshallable objects. Fortunately, marshal doesn't need to produce cross-compatible files, so the portability issues don't apply. So, let's suppose that marshalling a compiled regex became possible. It would need to be (a) absolutely guaranteed to have the same effect as compiling the original text string, and (b) faster than compiling the original text string, otherwise it's useless. This is where testing would be needed: can it actually save any significant amount of time?
Wow that's an old post of mine I searched on Google before post this, hope there is no omission.
You're absolutely fine :) I was amused to find that a post of mine from nearly six years ago should be the most notable on the subject, is all. Good work digging it up. ChrisA

There are some other advantages to this. For me the most interesting is that we can know from code easier that something is a regex. For my mutation tester mutmut I have an experimental regex mutation system but it just feels wrong to write hacky heuristics to guess if a string is a regex. And it's complicated to look at too much context (although I'm working on ways to make that type of thing radically nicer to do). It would be much nicer if I could just know based on the AST node type. I guess the same goes for static analyzers. / Anders

Maybe this literal will encourage people to finish tasks using regex, even lead to abuse regex, will this change Python's style? What's worse is, people using mixed manners in the same project: one_line.split(',') ... p','.split(one_line) Maybe it will break the Python's style, reduce code readability, is this worth it?

On Fri, Dec 28, 2018 at 1:56 AM Ma Lin <malincns@163.com> wrote:
The bar for introducing a new type of literal should be very high. Do performance numbers show this change would have a large impact for a large amount of libraries and programs? In my opinion, only if this change would make 50% of programs run 50% faster then it might be worth discussing. The damage to readability, burden of changing syntax and burden of yet another language feature for newcomers to learn is too high. Cheers, Yuval

On Mon, Dec 31, 2018 at 12:48:56AM -0800, Yuval Greenfield wrote:
In my opinion, only if this change would make 50% of programs run 50% faster then it might be worth discussing.
What if it were 100% of programs 25% faster? *wink* Generally speaking, we don't introduce new syntax as a speed optimization. The main reasons to introduce syntax is for convenience and to improve the expressiveness of code. That's why we usually prefer to use operators like + and == instead of functions add() and equal(). There's nothing a list comprehension can do that a for-loop can't, but list comps are often more expressive. And the class statement is just syntactic sugar for type(name, bases, dict), but much more convenient. In this specific case, I don't think that regex literals will add much expressiveness: regex = re.compile(r"...") regex = p("...") is not that much different. -- Steve

True, but when the literal is put somewhere far from the compile() call it becomes a problem for static analysis. Conceptually a regex is not a string but an embedded foreign language. That's why I think this discussion is worth having. It would be nice with a way to mark up foreign languages in a way that had some other advantages so people would be incentivised to do it, but just a way to mark it with comments would be fine too I think if it's standardized. Maybe the discussion should be expanded to cover the general case of embedded foreign languages? SQL, HTML, CSS and (obviously) regex comes to mind. One could also think of C for stuff like CFFI. / Anders

I am a full -1 on this idea -
Sorry for sounding over-reactive, but yes, this could make Python look like Perl. I think one full advantage of Python is exactly that regexps are treated fairly, with no special syntax. You call a function, or build an instance, and have the regex power, and that is it. And you can just plug any third-party regex module, and it will work just like the one that is built-in the language. This proposal at least keep the ' " ' quotes - so we don't end up like Javascript which has a "squeashy regexy" thing that can sneak in code and you are never sure when it is run, or even if it can be assigned to a variable at all. I am quite sure that if the mater is performance, a way to pickle, or somehow store pre-compiled regexes can be found without requiring special syntax. And a 3rd shortcoming - flags can't be passed as parameters, and have to be built-in the regexp themselves, further complicating the readability even for very simple regular expressions. Other than that it would not be much different from the ' f" ' strings thing, indeed, On Thu, 27 Dec 2018 at 09:49, Ma Lin <malincns@163.com> wrote:

On 18-12-28 22:54, Joao S. O. Bueno wrote:
I'm thinking, if people ask these questions in their mind when reading a piece of Python code: 1, "Is this Python code?" 2, "What's the purpose of this code?" 3, "How can I modify it if I want to ... ?" Maybe Python is on a doubtful way. There is an interesting question: Will literal p"" ruin Python's (or other dynamic languages like Ruby) style? Why will this happen?

for regular strings one can write "aaa" + "bbb" which also works for f-strings, r-strings, etc.; in regular expressions, there is, e.g., parameter counting and references to numbered matches. How would that be dealt with in a compound p-string? Either it would have to re-compiled or not, either way could lead to unexpected results p"(\d)\1" + p"(\s)\1" or p"^(\w)" + p"^(\d)" regular strings can be added, bu the results of p-string could not - well, their are not strings. This brings me to the point that the key difference is that f- and r- strings actually return strings, whereas p- string would return a different kind of object. That would seem certainly very confusing to novices - and also for the language standard as a whole. -Alexander

On Sat, Dec 29, 2018 at 04:29:32PM +1100, Alexander Heger wrote:
What does Perl do?
p"(\d)\1" + p"(\s)\1"
Since + is used for concatenation, then that would obviously be the same as: p"(\d)\1(\s)\1" Whether it gets done at compile-time or run-time depends on how smart the keyhole optimiser is. If it is smart enough to recognise regex literals, it could fold the two strings together and regex-compile them at python-compile time, otherwise it could be equivalent to: _t1 = re.compile(r"(\d)\1") # compile-time _t2 = re.compile(r"(\s)\1") # compile-time re.compile(_t1.pattern + _t2.pattern) # run-time Obviously that defeats the purpose of using a p"" pre-compiled regex object, but the answer to that is either: 1. Don't do that then; or 2. We better make sure the keyhole optimizer is smarter. Or we just ban concatenation. "P-strings" aren't strings, even though they look like them.
This brings me to the point that the key difference is that f- and r- strings actually return strings,
To be precise, f-"strings" are actually code that returns a string when executed at runtime; r-strings are literal syntax for strings.
Indeed. Perhaps something like \\regex\\ would be better, *if* this feature is desired. -- Steve

On Sat, Dec 29, 2018 at 12:30 AM Alexander Heger <python@2sn.net> wrote:
Isn't this a feature, not a bug, of encouraging literals to be specified as patterns: addition of patterns would raise an error (as is currently the case for addition of compiled patterns in the re and regex modules)? Currently, I find it easiest to use r-strings for patterns and call re.search() etc. without precompiling them, which means that I could accidentally concatenate two patterns together that would silently produce an unmatchable pattern. Using p-literals for most patterns would mean I have to be explicit in the exceptional case where I do want to assemble a pattern from multiple parts: FIRSTNAME = p"[A-Z][-A-Za-z']+" LASTNAME = p"[-A-Za-z']([-A-Za-z' ]+[-A-Za-z'])?" FULLNAME = FIRSTNAME + p' ' + LASTNAME # error FIRSTNAME = r"[A-Z][-A-Za-z']+" LASTNAME = r"[-A-Za-z']([-A-Za-z' ]+[-A-Za-z'])?" FULLNAME = re.compile(FIRSTNAME + ' ' + LASTNAME) # success Another potential advantage is that an ill-formed p-literal (such as a mismatched parenthesis) would be caught immediately, rather than when it is first used. This could pay off, for example, if I am defining a data structure with a bunch of regexes that would get used for different input. (But there may be performance tradeoffs here.)
The b prefix produces a bytes literal. Is a bytes object a kind of string, more so than a regex pattern is? I could see an argument that bytes is a particular encoding of sequential character data, whereas a regex pattern represents a string *language*, i.e. an abstraction over string data. But...this distinction starts to feel very theoretical rather than practical. If novices are expected to read code with regular expressions in it, why would they have trouble understanding that the "p" prefix means "pattern"? As someone who works with text a lot, I think there's a decent practicality-beats-purity argument in favor of p-literals, which would make regex operations more easily accessible and prevent patterns from being mixed up with string data. A potential downside, though, is that it will be tempting to introduce flags as prefixes, too. Do we want to go down the road of pui"my Unicode-compatible case-insensitive pattern"? Nathan

I don't see a justification for baking REs into the syntax of Python. In the Python world, REs are just one tool in a toolbox containing a great many tools. What's more, it's a tool that should be used with considerable reluctance, because REs are essentially unreadable, so every time you use one you're creating a maintenance headache. This quality is quite the opposite of what one would expect from a core language feature. -- Greg

Well, it requires some experience to read REs, I have written many, and I still need to test thoroughly even many basic ones for that they really do what they are supposed to do. And then there is the issue that there is many different implementation, what you have to escape, etc., varies between python (raw and regular strings), emacs, grep, overleaf, ... Never mind, my main point is that they return an object that is qualitatively different from a string, for example, in terms of concatenation. I also think it is too specialised, and time-critical constant REs can be stored in the module body, etc., if need be. I do that. But since this is the ideas mailing list, and taking this thread on an excursion, maybe an "addition" operator could be defined for REs, such that re.compile(s1 + s1) == re.compile(s1) + re.compile(s2) with the restriction that s1 and s2 are strings that are valid REs each. Even that would leave questions about how to deal with compile flags; they probably should be treated the same as if they were embedded at the beginning of each string. -Alexander

I have a compromise idea, here is some points: 1, Create a built-in class `pattern_str` which is a subclass of `str`, it's dedicated to regex pattern string. 2, Use p"" to represent `pattern_str`. Some advantages: 1, Since it's a subclass of `str`, we can use it as normal `str`. 2, IDE/linter/compiler can identify it as an regex pattern, something like type hint in language level. 3, We can still store compiled pattern in .pyc file *quietly*. 4, Won't introduce Perl style into Python, to avoid abusing regex in some degree. We still using regex in the old way: import re re.search(p"(?i)[a-z]", s) But if re.search() find the pattern is a `pattern_str`, it load compiled pattern from .pyc file directly.

On Thu, 27 Dec 2018 19:48:40 +0800 Ma Lin <malincns@163.com> wrote:
The simple solution to the perceived performance problem (not sure how much of a problem it is in real life) is to have a stdlib function that lazily-compiles a regex (*). Just like "re.compile", but lazy: you don't bear the cost of compiling when simply importing the module, but once the pattern is compiled, there is no overhead for looking up a global cache dict. No need for a dedicated literal. (*) Let's call it "re.pattern", for example. Regards Antoine.

On 31.12.2018 12:23, Antoine Pitrou wrote:
No need for a new function :-) We already have re.search() and re.match() which deal with compilation on-the-fly and caching. Perhaps the documentation should hint at this more explicitly... https://docs.python.org/3.7/library/re.html -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Dec 31 2018)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

Le 31/12/2018 à 12:31, M.-A. Lemburg a écrit :
The complaint is that the global cache is still too costly. See measurements in https://bugs.python.org/issue35559 Regards Antoine.

On 18-12-31 19:47, Antoine Pitrou wrote:
The complaint is that the global cache is still too costly. See measurements in https://bugs.python.org/issue35559
In this issue, using a global variable `_has_non_base16_digits` [1] will accelerate 30%. Is re module's internal cache [2] so bad? If rewrite re module's cache with C and use a custom data structure, maybe we will get a small speedup. [1] `_has_non_base16_digits` in PR11287 [1] https://github.com/python/cpython/pull/11287/files [2] re module's internal cache code: [2] https://github.com/python/cpython/blob/master/Lib/re.py#L268-L295 _cache = {} # ordered! _MAXCACHE = 512 def _compile(pattern, flags): # internal: compile pattern if isinstance(flags, RegexFlag): flags = flags.value try: return _cache[type(pattern), pattern, flags] except KeyError: pass ...

On 19-1-1 21:39, Stefan Behnel wrote:
Apply this patch: def _compile(pattern, flags): # internal: compile pattern - if isinstance(flags, RegexFlag): - flags = flags.value + try: + flags = int(flags) + except: + pass try: return _cache[type(pattern), pattern, flags] except KeyError: Then run this benchmark on my Raspberry Pi 3B: import perf runner = perf.Runner() runner.timeit(name="compile_re", stmt="re.compile(b'[^0-9A-F]')", setup="import re") Mean +- std dev: [a] 7.71 us +- 0.09 us -> [b] 6.74 us +- 0.10 us: 1.14x faster (-13%) Looks great.

On Thu, Dec 27, 2018 at 10:49 PM Ma Lin <malincns@163.com> wrote:
Before discussing something specific like regex literal syntax, I would love to see a way to measure that sort of performance difference. Does anyone here have MacroPy experience or something and could mock something up that would precompile and save a regex? In theory, it would be possible to tag ANY value as "constant once evaluated" and have it saved in the pyc. It'd be good to know just how much benefit this precompilation actually grants.
Wow that's an old post of mine :) ChrisA

Ma Lin schrieb am 27.12.18 um 14:15:
That's from the external regex package, not the stdlib re module.
Look a little closer:
What this does, essentially, is to make the pickle loader pass the original regex pattern string into re.compile() to "unpickle" it. Meaning, it compiles the regex on the way in. Thus, there isn't much to gain from using (the current form of) regex pickling here. I'm not saying that this can't be changed, but personally, this is exactly what I would do if I was asked to make a compiled regex picklable. Everything else would probably get you into portability hell. Stefan

Reply to Stefan Behnel and Chris Angelico. On 18-12-27 22:42, Stefan Behnel wrote:
Yes, re module only pickles pattern string and flags, it's safe for cross-version pickle/unpickle. re module's pickle code: def _pickle(p): return _compile, (p.pattern, p.flags) copyreg.pickle(Pattern, _pickle, _compile) On 18-12-28 1:27, Chris Angelico wrote:
I had followed regex module for a year, it does pickle the compiled data, this is its code: def _pickle(pattern): return _regex.compile, pattern._pickled_data _copy_reg.pickle(Pattern, _pickle) // in _regex.c file self->pickled_data = Py_BuildValue("OnOOOOOnOnn", pattern, flags, code_list, groupindex, indexgroup, named_lists, named_list_indexes, req_offset, required_chars, req_flags, public_group_count); if (!self->pickled_data) { Py_DECREF(self); return NULL; }

On Fri, Dec 28, 2018 at 12:15 AM Ma Lin <malincns@163.com> wrote:
What Stefan pointed out regarding the stdlib's "re" module is also true of the third party "regex" - unpickling just compiles from the original string. Regarding pyc files, though, pickle is less significant than marshal. And both re.compile() and regex.compile() return unmarshallable objects. Fortunately, marshal doesn't need to produce cross-compatible files, so the portability issues don't apply. So, let's suppose that marshalling a compiled regex became possible. It would need to be (a) absolutely guaranteed to have the same effect as compiling the original text string, and (b) faster than compiling the original text string, otherwise it's useless. This is where testing would be needed: can it actually save any significant amount of time?
Wow that's an old post of mine I searched on Google before post this, hope there is no omission.
You're absolutely fine :) I was amused to find that a post of mine from nearly six years ago should be the most notable on the subject, is all. Good work digging it up. ChrisA

There are some other advantages to this. For me the most interesting is that we can know from code easier that something is a regex. For my mutation tester mutmut I have an experimental regex mutation system but it just feels wrong to write hacky heuristics to guess if a string is a regex. And it's complicated to look at too much context (although I'm working on ways to make that type of thing radically nicer to do). It would be much nicer if I could just know based on the AST node type. I guess the same goes for static analyzers. / Anders

Maybe this literal will encourage people to finish tasks using regex, even lead to abuse regex, will this change Python's style? What's worse is, people using mixed manners in the same project: one_line.split(',') ... p','.split(one_line) Maybe it will break the Python's style, reduce code readability, is this worth it?

On Fri, Dec 28, 2018 at 1:56 AM Ma Lin <malincns@163.com> wrote:
The bar for introducing a new type of literal should be very high. Do performance numbers show this change would have a large impact for a large amount of libraries and programs? In my opinion, only if this change would make 50% of programs run 50% faster then it might be worth discussing. The damage to readability, burden of changing syntax and burden of yet another language feature for newcomers to learn is too high. Cheers, Yuval

On Mon, Dec 31, 2018 at 12:48:56AM -0800, Yuval Greenfield wrote:
In my opinion, only if this change would make 50% of programs run 50% faster then it might be worth discussing.
What if it were 100% of programs 25% faster? *wink* Generally speaking, we don't introduce new syntax as a speed optimization. The main reasons to introduce syntax is for convenience and to improve the expressiveness of code. That's why we usually prefer to use operators like + and == instead of functions add() and equal(). There's nothing a list comprehension can do that a for-loop can't, but list comps are often more expressive. And the class statement is just syntactic sugar for type(name, bases, dict), but much more convenient. In this specific case, I don't think that regex literals will add much expressiveness: regex = re.compile(r"...") regex = p("...") is not that much different. -- Steve

True, but when the literal is put somewhere far from the compile() call it becomes a problem for static analysis. Conceptually a regex is not a string but an embedded foreign language. That's why I think this discussion is worth having. It would be nice with a way to mark up foreign languages in a way that had some other advantages so people would be incentivised to do it, but just a way to mark it with comments would be fine too I think if it's standardized. Maybe the discussion should be expanded to cover the general case of embedded foreign languages? SQL, HTML, CSS and (obviously) regex comes to mind. One could also think of C for stuff like CFFI. / Anders

I am a full -1 on this idea -
Sorry for sounding over-reactive, but yes, this could make Python look like Perl. I think one full advantage of Python is exactly that regexps are treated fairly, with no special syntax. You call a function, or build an instance, and have the regex power, and that is it. And you can just plug any third-party regex module, and it will work just like the one that is built-in the language. This proposal at least keep the ' " ' quotes - so we don't end up like Javascript which has a "squeashy regexy" thing that can sneak in code and you are never sure when it is run, or even if it can be assigned to a variable at all. I am quite sure that if the mater is performance, a way to pickle, or somehow store pre-compiled regexes can be found without requiring special syntax. And a 3rd shortcoming - flags can't be passed as parameters, and have to be built-in the regexp themselves, further complicating the readability even for very simple regular expressions. Other than that it would not be much different from the ' f" ' strings thing, indeed, On Thu, 27 Dec 2018 at 09:49, Ma Lin <malincns@163.com> wrote:

On 18-12-28 22:54, Joao S. O. Bueno wrote:
I'm thinking, if people ask these questions in their mind when reading a piece of Python code: 1, "Is this Python code?" 2, "What's the purpose of this code?" 3, "How can I modify it if I want to ... ?" Maybe Python is on a doubtful way. There is an interesting question: Will literal p"" ruin Python's (or other dynamic languages like Ruby) style? Why will this happen?

for regular strings one can write "aaa" + "bbb" which also works for f-strings, r-strings, etc.; in regular expressions, there is, e.g., parameter counting and references to numbered matches. How would that be dealt with in a compound p-string? Either it would have to re-compiled or not, either way could lead to unexpected results p"(\d)\1" + p"(\s)\1" or p"^(\w)" + p"^(\d)" regular strings can be added, bu the results of p-string could not - well, their are not strings. This brings me to the point that the key difference is that f- and r- strings actually return strings, whereas p- string would return a different kind of object. That would seem certainly very confusing to novices - and also for the language standard as a whole. -Alexander

On Sat, Dec 29, 2018 at 04:29:32PM +1100, Alexander Heger wrote:
What does Perl do?
p"(\d)\1" + p"(\s)\1"
Since + is used for concatenation, then that would obviously be the same as: p"(\d)\1(\s)\1" Whether it gets done at compile-time or run-time depends on how smart the keyhole optimiser is. If it is smart enough to recognise regex literals, it could fold the two strings together and regex-compile them at python-compile time, otherwise it could be equivalent to: _t1 = re.compile(r"(\d)\1") # compile-time _t2 = re.compile(r"(\s)\1") # compile-time re.compile(_t1.pattern + _t2.pattern) # run-time Obviously that defeats the purpose of using a p"" pre-compiled regex object, but the answer to that is either: 1. Don't do that then; or 2. We better make sure the keyhole optimizer is smarter. Or we just ban concatenation. "P-strings" aren't strings, even though they look like them.
This brings me to the point that the key difference is that f- and r- strings actually return strings,
To be precise, f-"strings" are actually code that returns a string when executed at runtime; r-strings are literal syntax for strings.
Indeed. Perhaps something like \\regex\\ would be better, *if* this feature is desired. -- Steve

On Sat, Dec 29, 2018 at 12:30 AM Alexander Heger <python@2sn.net> wrote:
Isn't this a feature, not a bug, of encouraging literals to be specified as patterns: addition of patterns would raise an error (as is currently the case for addition of compiled patterns in the re and regex modules)? Currently, I find it easiest to use r-strings for patterns and call re.search() etc. without precompiling them, which means that I could accidentally concatenate two patterns together that would silently produce an unmatchable pattern. Using p-literals for most patterns would mean I have to be explicit in the exceptional case where I do want to assemble a pattern from multiple parts: FIRSTNAME = p"[A-Z][-A-Za-z']+" LASTNAME = p"[-A-Za-z']([-A-Za-z' ]+[-A-Za-z'])?" FULLNAME = FIRSTNAME + p' ' + LASTNAME # error FIRSTNAME = r"[A-Z][-A-Za-z']+" LASTNAME = r"[-A-Za-z']([-A-Za-z' ]+[-A-Za-z'])?" FULLNAME = re.compile(FIRSTNAME + ' ' + LASTNAME) # success Another potential advantage is that an ill-formed p-literal (such as a mismatched parenthesis) would be caught immediately, rather than when it is first used. This could pay off, for example, if I am defining a data structure with a bunch of regexes that would get used for different input. (But there may be performance tradeoffs here.)
The b prefix produces a bytes literal. Is a bytes object a kind of string, more so than a regex pattern is? I could see an argument that bytes is a particular encoding of sequential character data, whereas a regex pattern represents a string *language*, i.e. an abstraction over string data. But...this distinction starts to feel very theoretical rather than practical. If novices are expected to read code with regular expressions in it, why would they have trouble understanding that the "p" prefix means "pattern"? As someone who works with text a lot, I think there's a decent practicality-beats-purity argument in favor of p-literals, which would make regex operations more easily accessible and prevent patterns from being mixed up with string data. A potential downside, though, is that it will be tempting to introduce flags as prefixes, too. Do we want to go down the road of pui"my Unicode-compatible case-insensitive pattern"? Nathan

I don't see a justification for baking REs into the syntax of Python. In the Python world, REs are just one tool in a toolbox containing a great many tools. What's more, it's a tool that should be used with considerable reluctance, because REs are essentially unreadable, so every time you use one you're creating a maintenance headache. This quality is quite the opposite of what one would expect from a core language feature. -- Greg

Well, it requires some experience to read REs, I have written many, and I still need to test thoroughly even many basic ones for that they really do what they are supposed to do. And then there is the issue that there is many different implementation, what you have to escape, etc., varies between python (raw and regular strings), emacs, grep, overleaf, ... Never mind, my main point is that they return an object that is qualitatively different from a string, for example, in terms of concatenation. I also think it is too specialised, and time-critical constant REs can be stored in the module body, etc., if need be. I do that. But since this is the ideas mailing list, and taking this thread on an excursion, maybe an "addition" operator could be defined for REs, such that re.compile(s1 + s1) == re.compile(s1) + re.compile(s2) with the restriction that s1 and s2 are strings that are valid REs each. Even that would leave questions about how to deal with compile flags; they probably should be treated the same as if they were embedded at the beginning of each string. -Alexander

I have a compromise idea, here is some points: 1, Create a built-in class `pattern_str` which is a subclass of `str`, it's dedicated to regex pattern string. 2, Use p"" to represent `pattern_str`. Some advantages: 1, Since it's a subclass of `str`, we can use it as normal `str`. 2, IDE/linter/compiler can identify it as an regex pattern, something like type hint in language level. 3, We can still store compiled pattern in .pyc file *quietly*. 4, Won't introduce Perl style into Python, to avoid abusing regex in some degree. We still using regex in the old way: import re re.search(p"(?i)[a-z]", s) But if re.search() find the pattern is a `pattern_str`, it load compiled pattern from .pyc file directly.

On Thu, 27 Dec 2018 19:48:40 +0800 Ma Lin <malincns@163.com> wrote:
The simple solution to the perceived performance problem (not sure how much of a problem it is in real life) is to have a stdlib function that lazily-compiles a regex (*). Just like "re.compile", but lazy: you don't bear the cost of compiling when simply importing the module, but once the pattern is compiled, there is no overhead for looking up a global cache dict. No need for a dedicated literal. (*) Let's call it "re.pattern", for example. Regards Antoine.

On 31.12.2018 12:23, Antoine Pitrou wrote:
No need for a new function :-) We already have re.search() and re.match() which deal with compilation on-the-fly and caching. Perhaps the documentation should hint at this more explicitly... https://docs.python.org/3.7/library/re.html -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Dec 31 2018)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

Le 31/12/2018 à 12:31, M.-A. Lemburg a écrit :
The complaint is that the global cache is still too costly. See measurements in https://bugs.python.org/issue35559 Regards Antoine.

On 18-12-31 19:47, Antoine Pitrou wrote:
The complaint is that the global cache is still too costly. See measurements in https://bugs.python.org/issue35559
In this issue, using a global variable `_has_non_base16_digits` [1] will accelerate 30%. Is re module's internal cache [2] so bad? If rewrite re module's cache with C and use a custom data structure, maybe we will get a small speedup. [1] `_has_non_base16_digits` in PR11287 [1] https://github.com/python/cpython/pull/11287/files [2] re module's internal cache code: [2] https://github.com/python/cpython/blob/master/Lib/re.py#L268-L295 _cache = {} # ordered! _MAXCACHE = 512 def _compile(pattern, flags): # internal: compile pattern if isinstance(flags, RegexFlag): flags = flags.value try: return _cache[type(pattern), pattern, flags] except KeyError: pass ...

On 19-1-1 21:39, Stefan Behnel wrote:
Apply this patch: def _compile(pattern, flags): # internal: compile pattern - if isinstance(flags, RegexFlag): - flags = flags.value + try: + flags = int(flags) + except: + pass try: return _cache[type(pattern), pattern, flags] except KeyError: Then run this benchmark on my Raspberry Pi 3B: import perf runner = perf.Runner() runner.timeit(name="compile_re", stmt="re.compile(b'[^0-9A-F]')", setup="import re") Mean +- std dev: [a] 7.71 us +- 0.09 us -> [b] 6.74 us +- 0.10 us: 1.14x faster (-13%) Looks great.
participants (14)
-
Alexander Heger
-
Anders Hovmöller
-
Antoine Pitrou
-
Antoine Pitrou
-
Chris Angelico
-
Greg Ewing
-
Joao S. O. Bueno
-
M.-A. Lemburg
-
Ma Lin
-
MRAB
-
Nathan Schneider
-
Stefan Behnel
-
Steven D'Aprano
-
Yuval Greenfield