Mailman 3 Add regex pattern literal p"" - Python-ideas

newer
Make the @contextmanager of...

Add regex pattern literal p""

older
struct.unpack should support open...

Ma Lin

27 Dec 2018 27 Dec '18

4:48 a.m.

We can use this literal to represent a compiled pattern, for example:

...

...
...
p"(?i)[a-z]".findall("a1B2c3") ['a', 'B', 'c']

...

...
...
compiled = p"(?<=abc)def" m = compiled.search('abcdef') m.group(0) 'def'

...

...
...
rp'\W+'.split('Words, words, words.') ['Words', 'words', 'words', '']

This allows peephole optimizer to store compiled pattern in .pyc file, we can get performance optimization like replacing constant set by frozenset in .pyc file. Then such issue [1] can be solved perfectly. [1] Optimize base64.b16decode to use compiled regex [1] https://bugs.python.org/issue35559 Two shortcomings: 1, Elevating a class in a module (re.Pattern) to language level, this sounds not very natural. This makes Python looks like Perl. 2, We can't use regex module as a drop-in replacement: import regex as re IMHO, I would like to see regex module be adopted into stdlib after cutting off its "full case-folding" and "fuzzy matching" features. Related links: [2] Chris Angelico conceived of "compiled regexes be stored in .pyc file" in March 2013. [2] https://mail.python.org/pipermail/python-ideas/2013-March/020043.html [3] Ken Hilton conceived of "Give regex operations more sugar" in June 2018. [3] https://mail.python.org/pipermail/python-ideas/2018-June/051395.html

Show replies by date

Chris Angelico

27 Dec 27 Dec

5:11 a.m.

On Thu, Dec 27, 2018 at 10:49 PM Ma Lin wrote:

...

We can use this literal to represent a compiled pattern, for example:

...
...
...
p"(?i)[a-z]".findall("a1B2c3") ['a', 'B', 'c']

...
...
...
compiled = p"(?<=abc)def" m = compiled.search('abcdef') m.group(0) 'def'

...
...
...
rp'\W+'.split('Words, words, words.') ['Words', 'words', 'words', '']

This allows peephole optimizer to store compiled pattern in .pyc file, we can get performance optimization like replacing constant set by frozenset in .pyc file.

Before discussing something specific like regex literal syntax, I would love to see a way to measure that sort of performance difference. Does anyone here have MacroPy experience or something and could mock something up that would precompile and save a regex? In theory, it would be possible to tag ANY value as "constant once evaluated" and have it saved in the pyc. It'd be good to know just how much benefit this precompilation actually grants.

...

[2] Chris Angelico conceived of "compiled regexes be stored in .pyc file" in March 2013. [2] https://mail.python.org/pipermail/python-ideas/2013-March/020043.html

Wow that's an old post of mine :) ChrisA

Ma Lin

6:15 a.m.

...

It'd be good to know just how much benefit this precompilation actually grants.

As far as I know, Pattern objects in regex module can be pickled, don't know if it's useful.

...

...
...
import pickle import regex p = regex.compile('[a-z]') b = pickle.dumps(p) p = pickle.loads(b)

...

Wow that's an old post of mine I searched on Google before post this, hope there is no omission.

Stefan Behnel

7:42 a.m.

Ma Lin schrieb am 27.12.18 um 14:15:

...

...
It'd be good to know just how much benefit this precompilation actually grants.

As far as I know, Pattern objects in regex module can be pickled, don't know if it's useful.

...
...
...
import pickle import regex

That's from the external regex package, not the stdlib re module.

...

...
...
...
p = regex.compile('[a-z]') b = pickle.dumps(p) p = pickle.loads(b)

Look a little closer:

...

...
...
import pickle, re p = re.compile("[abc]") pickle.dumps(p) b'\x80\x03cre\n_compile\nq\x00X\x05\x00\x00\x00[abc]q\x01K \x86q\x02Rq\x03.'

What this does, essentially, is to make the pickle loader pass the original regex pattern string into re.compile() to "unpickle" it. Meaning, it compiles the regex on the way in. Thus, there isn't much to gain from using (the current form of) regex pickling here. I'm not saying that this can't be changed, but personally, this is exactly what I would do if I was asked to make a compiled regex picklable. Everything else would probably get you into portability hell. Stefan

Ma Lin

28 Dec 28 Dec

2:54 a.m.

Reply to Stefan Behnel and Chris Angelico. On 18-12-27 22:42, Stefan Behnel wrote:

...

>>> import pickle, re >>> p = re.compile("[abc]") >>> pickle.dumps(p) b'\x80\x03cre\n_compile\nq\x00X\x05\x00\x00\x00[abc]q\x01K \x86q\x02Rq\x03.'

What this does, essentially, is to make the pickle loader pass the original regex pattern string into re.compile() to "unpickle" it. Meaning, it compiles the regex on the way in. Thus, there isn't much to gain from using (the current form of) regex pickling here.

Yes, re module only pickles pattern string and flags, it's safe for cross-version pickle/unpickle. re module's pickle code: def _pickle(p): return _compile, (p.pattern, p.flags) copyreg.pickle(Pattern, _pickle, _compile) On 18-12-28 1:27, Chris Angelico wrote:

...

What Stefan pointed out regarding the stdlib's "re" module is also true of the third party "regex" - unpickling just compiles from the original string.

I had followed regex module for a year, it does pickle the compiled data, this is its code: def _pickle(pattern): return _regex.compile, pattern._pickled_data _copy_reg.pickle(Pattern, _pickle) // in _regex.c file self->pickled_data = Py_BuildValue("OnOOOOOnOnn", pattern, flags, code_list, groupindex, indexgroup, named_lists, named_list_indexes, req_offset, required_chars, req_flags, public_group_count); if (!self->pickled_data) { Py_DECREF(self); return NULL; }

Chris Angelico

27 Dec 27 Dec

10:27 a.m.

On Fri, Dec 28, 2018 at 12:15 AM Ma Lin wrote:

...

...
It'd be good to know just how much benefit this precompilation actually grants.

As far as I know, Pattern objects in regex module can be pickled, don't know if it's useful.

...
...
...
import pickle import regex p = regex.compile('[a-z]') b = pickle.dumps(p) p = pickle.loads(b)

What Stefan pointed out regarding the stdlib's "re" module is also true of the third party "regex" - unpickling just compiles from the original string. Regarding pyc files, though, pickle is less significant than marshal. And both re.compile() and regex.compile() return unmarshallable objects. Fortunately, marshal doesn't need to produce cross-compatible files, so the portability issues don't apply. So, let's suppose that marshalling a compiled regex became possible. It would need to be (a) absolutely guaranteed to have the same effect as compiling the original text string, and (b) faster than compiling the original text string, otherwise it's useless. This is where testing would be needed: can it actually save any significant amount of time?

...

...
Wow that's an old post of mine I searched on Google before post this, hope there is no omission.

You're absolutely fine :) I was amused to find that a post of mine from nearly six years ago should be the most notable on the subject, is all. Good work digging it up. ChrisA

Anders Hovmöller

7:01 a.m.

...

We can use this literal to represent a compiled pattern, for example:

...
...
...
p"(?i)[a-z]".findall("a1B2c3") ['a', 'B', 'c']

There are some other advantages to this. For me the most interesting is that we can know from code easier that something is a regex. For my mutation tester mutmut I have an experimental regex mutation system but it just feels wrong to write hacky heuristics to guess if a string is a regex. And it's complicated to look at too much context (although I'm working on ways to make that type of thing radically nicer to do). It would be much nicer if I could just know based on the AST node type. I guess the same goes for static analyzers. / Anders

MRAB

10:47 a.m.

On 2018-12-27 11:48, Ma Lin wrote: [snip]

...

2, We can't use regex module as a drop-in replacement: import regex as re IMHO, I would like to see regex module be adopted into stdlib after cutting off its "full case-folding" and "fuzzy matching" features.

I think that omitting full casefolding would be a bad idea; after all, strings (in Python 3) have a .casefold method.

Steven D'Aprano

3 p.m.

On Thu, Dec 27, 2018 at 05:47:46PM +0000, MRAB wrote:

...

On 2018-12-27 11:48, Ma Lin wrote: [snip]

...
2, We can't use regex module as a drop-in replacement: import regex as re IMHO, I would like to see regex module be adopted into stdlib after cutting off its "full case-folding" and "fuzzy matching" features.

I think that omitting full casefolding would be a bad idea; after all, strings (in Python 3) have a .casefold method.

And I don't understand why omitting fuzzy matching is a good idea. If you don't want fuzzy matching, don't use it in your code. But why remove it? -- Steve

Ma Lin

28 Dec 28 Dec

2:55 a.m.

New subject: In fact, I'm a bit worry about this literal p""

Maybe this literal will encourage people to finish tasks using regex, even lead to abuse regex, will this change Python's style? What's worse is, people using mixed manners in the same project: one_line.split(',') ... p','.split(one_line) Maybe it will break the Python's style, reduce code readability, is this worth it?

Yuval Greenfield

31 Dec 31 Dec

1:48 a.m.

New subject: In fact, I'm a bit worry about this literal p""

On Fri, Dec 28, 2018 at 1:56 AM Ma Lin wrote:

...

Maybe this literal will encourage people to finish tasks using regex, even lead to abuse regex, will this change Python's style?

What's worse is, people using mixed manners in the same project:

one_line.split(',') ... p','.split(one_line)

Maybe it will break the Python's style, reduce code readability, is this worth it?

The bar for introducing a new type of literal should be very high. Do performance numbers show this change would have a large impact for a large amount of libraries and programs? In my opinion, only if this change would make 50% of programs run 50% faster then it might be worth discussing. The damage to readability, burden of changing syntax and burden of yet another language feature for newcomers to learn is too high. Cheers, Yuval

Steven D'Aprano

3:54 a.m.

New subject: In fact, I'm a bit worry about this literal p""

On Mon, Dec 31, 2018 at 12:48:56AM -0800, Yuval Greenfield wrote:

...

In my opinion, only if this change would make 50% of programs run 50% faster then it might be worth discussing.

What if it were 100% of programs 25% faster? *wink* Generally speaking, we don't introduce new syntax as a speed optimization. The main reasons to introduce syntax is for convenience and to improve the expressiveness of code. That's why we usually prefer to use operators like + and == instead of functions add() and equal(). There's nothing a list comprehension can do that a for-loop can't, but list comps are often more expressive. And the class statement is just syntactic sugar for type(name, bases, dict), but much more convenient. In this specific case, I don't think that regex literals will add much expressiveness: regex = re.compile(r"...") regex = p("...") is not that much different. -- Steve

Anders Hovmöller

5:07 a.m.

New subject: In fact, I'm a bit worry about this literal p""

...

regex = re.compile(r"...") regex = p("...")

is not that much different.

True, but when the literal is put somewhere far from the compile() call it becomes a problem for static analysis. Conceptually a regex is not a string but an embedded foreign language. That's why I think this discussion is worth having. It would be nice with a way to mark up foreign languages in a way that had some other advantages so people would be incentivised to do it, but just a way to mark it with comments would be fine too I think if it's standardized. Maybe the discussion should be expanded to cover the general case of embedded foreign languages? SQL, HTML, CSS and (obviously) regex comes to mind. One could also think of C for stuff like CFFI. / Anders

Joao S. O. Bueno

28 Dec 28 Dec

7:54 a.m.

I am a full -1 on this idea -

...

Two shortcomings:

1, Elevating a class in a module (re.Pattern) to language level, this sounds not very natural. This makes Python looks like Perl.

2, We can't use regex module as a drop-in replacement: import regex as re IMHO, I would like to see regex module be adopted into stdlib after cutting off its "full case-folding" and "fuzzy matching" features.

Sorry for sounding over-reactive, but yes, this could make Python look like Perl. I think one full advantage of Python is exactly that regexps are treated fairly, with no special syntax. You call a function, or build an instance, and have the regex power, and that is it. And you can just plug any third-party regex module, and it will work just like the one that is built-in the language. This proposal at least keep the ' " ' quotes - so we don't end up like Javascript which has a "squeashy regexy" thing that can sneak in code and you are never sure when it is run, or even if it can be assigned to a variable at all. I am quite sure that if the mater is performance, a way to pickle, or somehow store pre-compiled regexes can be found without requiring special syntax. And a 3rd shortcoming - flags can't be passed as parameters, and have to be built-in the regexp themselves, further complicating the readability even for very simple regular expressions. Other than that it would not be much different from the ' f" ' strings thing, indeed, On Thu, 27 Dec 2018 at 09:49, Ma Lin wrote:

...

Related links:

[2] Chris Angelico conceived of "compiled regexes be stored in .pyc file" in March 2013. [2] https://mail.python.org/pipermail/python-ideas/2013-March/020043.html

[3] Ken Hilton conceived of "Give regex operations more sugar" in June 2018. [3] https://mail.python.org/pipermail/python-ideas/2018-June/051395.html

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Ma Lin

7:42 p.m.

On 18-12-28 22:54, Joao S. O. Bueno wrote:

...

Sorry for sounding over-reactive, but yes, this could make Python look like Perl. Yes, this may introduce Perl's style irreversibly, we need to be cautious about this.

I'm thinking, if people ask these questions in their mind when reading a piece of Python code: 1, "Is this Python code?" 2, "What's the purpose of this code?" 3, "How can I modify it if I want to ... ?" Maybe Python is on a doubtful way. There is an interesting question: Will literal p"" ruin Python's (or other dynamic languages like Ruby) style? Why will this happen?

...

And a 3rd shortcoming - flags can't be passed as parameters, and have to be built-in the regexp themselves, further complicating the readability even for very simple regular expressions. IMO this is an advantage, it's hard to omit flags when reading/copying an regex pattern.

Alexander Heger

10:29 p.m.

for regular strings one can write "aaa" + "bbb" which also works for f-strings, r-strings, etc.; in regular expressions, there is, e.g., parameter counting and references to numbered matches. How would that be dealt with in a compound p-string? Either it would have to re-compiled or not, either way could lead to unexpected results p"(\d)\1" + p"(\s)\1" or p"^(\w)" + p"^(\d)" regular strings can be added, bu the results of p-string could not - well, their are not strings. This brings me to the point that the key difference is that f- and r- strings actually return strings, whereas p- string would return a different kind of object. That would seem certainly very confusing to novices - and also for the language standard as a whole. -Alexander

Steven D'Aprano

11:52 p.m.

On Sat, Dec 29, 2018 at 04:29:32PM +1100, Alexander Heger wrote:

...

for regular strings one can write

"aaa" + "bbb"

which also works for f-strings, r-strings, etc.; in regular expressions, there is, e.g., parameter counting and references to numbered matches. How would that be dealt with in a compound p-string? Either it would have to re-compiled or not, either way could lead to unexpected results

What does Perl do?

...

p"(\d)\1" + p"(\s)\1"

Since + is used for concatenation, then that would obviously be the same as: p"(\d)\1(\s)\1" Whether it gets done at compile-time or run-time depends on how smart the keyhole optimiser is. If it is smart enough to recognise regex literals, it could fold the two strings together and regex-compile them at python-compile time, otherwise it could be equivalent to: _t1 = re.compile(r"(\d)\1") # compile-time _t2 = re.compile(r"(\s)\1") # compile-time re.compile(_t1.pattern + _t2.pattern) # run-time Obviously that defeats the purpose of using a p"" pre-compiled regex object, but the answer to that is either: 1. Don't do that then; or 2. We better make sure the keyhole optimizer is smarter. Or we just ban concatenation. "P-strings" aren't strings, even though they look like them.

...

This brings me to the point that the key difference is that f- and r- strings actually return strings,

To be precise, f-"strings" are actually code that returns a string when executed at runtime; r-strings are literal syntax for strings.

...

whereas p- string would return a different kind of object. That would seem certainly very confusing to novices - and also for the language standard as a whole.

Indeed. Perhaps something like \\regex\\ would be better, *if* this feature is desired. -- Steve

Greg Ewing

30 Dec 30 Dec

3:55 p.m.

Steven D'Aprano wrote:

...

_t1 = re.compile(r"(\d)\1") # compile-time _t2 = re.compile(r"(\s)\1") # compile-time re.compile(_t1.pattern + _t2.pattern) # run-time

It would be weird if p"(\d)\1" + p"(\s)\1" worked but re.compile(r"(\d)\1") + re.compile(r"(\s)\1") didn't. -- Greg

Nathan Schneider

28 Dec 28 Dec

11:56 p.m.

On Sat, Dec 29, 2018 at 12:30 AM Alexander Heger wrote:

...

for regular strings one can write

"aaa" + "bbb"

which also works for f-strings, r-strings, etc.; in regular expressions, there is, e.g., parameter counting and references to numbered matches. How would that be dealt with in a compound p-string? Either it would have to re-compiled or not, either way could lead to unexpected results

p"(\d)\1" + p"(\s)\1"

or

p"^(\w)" + p"^(\d)"

regular strings can be added, bu the results of p-string could not - well, their are not strings.

Isn't this a feature, not a bug, of encouraging literals to be specified as patterns: addition of patterns would raise an error (as is currently the case for addition of compiled patterns in the re and regex modules)? Currently, I find it easiest to use r-strings for patterns and call re.search() etc. without precompiling them, which means that I could accidentally concatenate two patterns together that would silently produce an unmatchable pattern. Using p-literals for most patterns would mean I have to be explicit in the exceptional case where I do want to assemble a pattern from multiple parts: FIRSTNAME = p"[A-Z][-A-Za-z']+" LASTNAME = p"[-A-Za-z']([-A-Za-z' ]+[-A-Za-z'])?" FULLNAME = FIRSTNAME + p' ' + LASTNAME # error FIRSTNAME = r"[A-Z][-A-Za-z']+" LASTNAME = r"[-A-Za-z']([-A-Za-z' ]+[-A-Za-z'])?" FULLNAME = re.compile(FIRSTNAME + ' ' + LASTNAME) # success Another potential advantage is that an ill-formed p-literal (such as a mismatched parenthesis) would be caught immediately, rather than when it is first used. This could pay off, for example, if I am defining a data structure with a bunch of regexes that would get used for different input. (But there may be performance tradeoffs here.)

...

This brings me to the point that the key difference is that f- and r- strings actually return strings, whereas p- string would return a different kind of object. That would seem certainly very confusing to novices - and also for the language standard as a whole.

The b prefix produces a bytes literal. Is a bytes object a kind of string, more so than a regex pattern is? I could see an argument that bytes is a particular encoding of sequential character data, whereas a regex pattern represents a string *language*, i.e. an abstraction over string data. But...this distinction starts to feel very theoretical rather than practical. If novices are expected to read code with regular expressions in it, why would they have trouble understanding that the "p" prefix means "pattern"? As someone who works with text a lot, I think there's a decent practicality-beats-purity argument in favor of p-literals, which would make regex operations more easily accessible and prevent patterns from being mixed up with string data. A potential downside, though, is that it will be tempting to introduce flags as prefixes, too. Do we want to go down the road of pui"my Unicode-compatible case-insensitive pattern"? Nathan

...

-Alexander

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Greg Ewing

30 Dec 30 Dec

3:44 p.m.

I don't see a justification for baking REs into the syntax of Python. In the Python world, REs are just one tool in a toolbox containing a great many tools. What's more, it's a tool that should be used with considerable reluctance, because REs are essentially unreadable, so every time you use one you're creating a maintenance headache. This quality is quite the opposite of what one would expect from a core language feature. -- Greg

Alexander Heger

4:35 p.m.

...

What's more, it's a tool that should be used with considerable reluctance, because REs are essentially unreadable, so every time you use one you're creating a maintenance headache.

Well, it requires some experience to read REs, I have written many, and I still need to test thoroughly even many basic ones for that they really do what they are supposed to do. And then there is the issue that there is many different implementation, what you have to escape, etc., varies between python (raw and regular strings), emacs, grep, overleaf, ... Never mind, my main point is that they return an object that is qualitatively different from a string, for example, in terms of concatenation. I also think it is too specialised, and time-critical constant REs can be stored in the module body, etc., if need be. I do that. But since this is the ideas mailing list, and taking this thread on an excursion, maybe an "addition" operator could be defined for REs, such that re.compile(s1 + s1) == re.compile(s1) + re.compile(s2) with the restriction that s1 and s2 are strings that are valid REs each. Even that would leave questions about how to deal with compile flags; they probably should be treated the same as if they were embedded at the beginning of each string. -Alexander

Ma Lin

29 Dec 29 Dec

4:49 a.m.

New subject: Use p"" to represent `pattern_str` -- a subclass of `str`

I have a compromise idea, here is some points: 1, Create a built-in class `pattern_str` which is a subclass of `str`, it's dedicated to regex pattern string. 2, Use p"" to represent `pattern_str`. Some advantages: 1, Since it's a subclass of `str`, we can use it as normal `str`. 2, IDE/linter/compiler can identify it as an regex pattern, something like type hint in language level. 3, We can still store compiled pattern in .pyc file *quietly*. 4, Won't introduce Perl style into Python, to avoid abusing regex in some degree. We still using regex in the old way: import re re.search(p"(?i)[a-z]", s) But if re.search() find the pattern is a `pattern_str`, it load compiled pattern from .pyc file directly.

Antoine Pitrou

31 Dec 31 Dec

4:23 a.m.

New subject: No need to add a regex pattern literal

On Thu, 27 Dec 2018 19:48:40 +0800 Ma Lin wrote:

...

We can use this literal to represent a compiled pattern, for example:

...
...
...
p"(?i)[a-z]".findall("a1B2c3") ['a', 'B', 'c']

...
...
...
compiled = p"(?<=abc)def" m = compiled.search('abcdef') m.group(0) 'def'

...
...
...
rp'\W+'.split('Words, words, words.') ['Words', 'words', 'words', '']

This allows peephole optimizer to store compiled pattern in .pyc file, we can get performance optimization like replacing constant set by frozenset in .pyc file.

Then such issue [1] can be solved perfectly. [1] Optimize base64.b16decode to use compiled regex [1] https://bugs.python.org/issue35559

The simple solution to the perceived performance problem (not sure how much of a problem it is in real life) is to have a stdlib function that lazily-compiles a regex (*). Just like "re.compile", but lazy: you don't bear the cost of compiling when simply importing the module, but once the pattern is compiled, there is no overhead for looking up a global cache dict. No need for a dedicated literal. (*) Let's call it "re.pattern", for example. Regards Antoine.

M.-A. Lemburg

4:31 a.m.

New subject: No need to add a regex pattern literal

On 31.12.2018 12:23, Antoine Pitrou wrote:

...

On Thu, 27 Dec 2018 19:48:40 +0800 Ma Lin wrote:

...
We can use this literal to represent a compiled pattern, for example:

...
...
...
p"(?i)[a-z]".findall("a1B2c3") ['a', 'B', 'c']

...
...
...
compiled = p"(?<=abc)def" m = compiled.search('abcdef') m.group(0) 'def'

...
...
...
rp'\W+'.split('Words, words, words.') ['Words', 'words', 'words', '']

This allows peephole optimizer to store compiled pattern in .pyc file, we can get performance optimization like replacing constant set by frozenset in .pyc file.

Then such issue [1] can be solved perfectly. [1] Optimize base64.b16decode to use compiled regex [1] https://bugs.python.org/issue35559

The simple solution to the perceived performance problem (not sure how much of a problem it is in real life) is to have a stdlib function that lazily-compiles a regex (*). Just like "re.compile", but lazy: you don't bear the cost of compiling when simply importing the module, but once the pattern is compiled, there is no overhead for looking up a global cache dict.

No need for a dedicated literal.

(*) Let's call it "re.pattern", for example.

No need for a new function :-) We already have re.search() and re.match() which deal with compilation on-the-fly and caching. Perhaps the documentation should hint at this more explicitly... https://docs.python.org/3.7/library/re.html -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Dec 31 2018)

...

...
...
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/

::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

Antoine Pitrou

4:47 a.m.

New subject: No need to add a regex pattern literal

Le 31/12/2018 à 12:31, M.-A. Lemburg a écrit :

...

We already have re.search() and re.match() which deal with compilation on-the-fly and caching. Perhaps the documentation should hint at this more explicitly...

The complaint is that the global cache is still too costly. See measurements in https://bugs.python.org/issue35559 Regards Antoine.

Ma Lin

6:02 a.m.

New subject: No need to add a regex pattern literal

On 18-12-31 19:47, Antoine Pitrou wrote:

...

The complaint is that the global cache is still too costly. See measurements in https://bugs.python.org/issue35559

In this issue, using a global variable `_has_non_base16_digits` [1] will accelerate 30%. Is re module's internal cache [2] so bad? If rewrite re module's cache with C and use a custom data structure, maybe we will get a small speedup. [1] `_has_non_base16_digits` in PR11287 [1] https://github.com/python/cpython/pull/11287/files [2] re module's internal cache code: [2] https://github.com/python/cpython/blob/master/Lib/re.py#L268-L295 _cache = {} # ordered! _MAXCACHE = 512 def _compile(pattern, flags): # internal: compile pattern if isinstance(flags, RegexFlag): flags = flags.value try: return _cache[type(pattern), pattern, flags] except KeyError: pass ...

Stefan Behnel

1 Jan 1 Jan

6:39 a.m.

New subject: No need to add a regex pattern literal

Ma Lin schrieb am 31.12.18 um 14:02:

...

On 18-12-31 19:47, Antoine Pitrou wrote:

...
The complaint is that the global cache is still too costly. See measurements in https://bugs.python.org/issue35559

In this issue, using a global variable `_has_non_base16_digits` [1] will accelerate 30%. Is re module's internal cache [2] so bad?

If rewrite re module's cache with C and use a custom data structure, maybe we will get a small speedup.

[1] `_has_non_base16_digits` in PR11287 [1] https://github.com/python/cpython/pull/11287/files

[2] re module's internal cache code: [2] https://github.com/python/cpython/blob/master/Lib/re.py#L268-L295

_cache = {} # ordered! _MAXCACHE = 512 def _compile(pattern, flags): # internal: compile pattern if isinstance(flags, RegexFlag): flags = flags.value try: return _cache[type(pattern), pattern, flags] except KeyError: pass ...

I wouldn't be surprised if the slowest part here was the isinstance() check. Maybe the RegexFlag class could implement "__hash__()" as "return hash(self.value)" ? Stefan

Ma Lin

7:51 p.m.

New subject: No need to add a regex pattern literal

On 19-1-1 21:39, Stefan Behnel wrote:

...

I wouldn't be surprised if the slowest part here was the isinstance() check. Maybe the RegexFlag class could implement "__hash__()" as "return hash(self.value)" ?

Apply this patch: def _compile(pattern, flags): # internal: compile pattern - if isinstance(flags, RegexFlag): - flags = flags.value + try: + flags = int(flags) + except: + pass try: return _cache[type(pattern), pattern, flags] except KeyError: Then run this benchmark on my Raspberry Pi 3B: import perf runner = perf.Runner() runner.timeit(name="compile_re", stmt="re.compile(b'[^0-9A-F]')", setup="import re") Mean +- std dev: [a] 7.71 us +- 0.09 us -> [b] 6.74 us +- 0.10 us: 1.14x faster (-13%) Looks great.

1940

Age (days ago)

1946

Last active (days ago)

List overview

Download

27 comments

14 participants

participants (14)

Alexander Heger
Anders Hovmöller
Antoine Pitrou
Antoine Pitrou
Chris Angelico
Greg Ewing
Joao S. O. Bueno
M.-A. Lemburg
Ma Lin
MRAB
Nathan Schneider
Stefan Behnel
Steven D'Aprano
Yuval Greenfield

Add regex pattern literal p""

tags

participants (14)