JITted regex engine from pypy
Hi I was reading a bit about the regex module and I would like to present some other solution into speeding up the re module for Python. So, as a bit of background - pypy has a re compatible module. It's also JITted and it's also exportable as a C library (that is a library you can call from C with C API, not a python extension module). I wonder if it would be worth to put some work into it to make it a library that CPython can use. On the minus side, the JIT only works on x86 and x86_64, on the plus side, since it's 100% API compatible, it can be used as a _xxx speedup module relatively easy. Do people have opinions? Cheers, fijal
On Sun, Jun 3, 2012 at 7:49 AM, Maciej Fijalkowski
Hi
I was reading a bit about the regex module and I would like to present some other solution into speeding up the re module for Python.
So, as a bit of background - pypy has a re compatible module. It's also JITted and it's also exportable as a C library (that is a library you can call from C with C API, not a python extension module). I wonder if it would be worth to put some work into it to make it a library that CPython can use.
On the minus side, the JIT only works on x86 and x86_64, on the plus side, since it's 100% API compatible, it can be used as a _xxx speedup module relatively easy.
Do people have opinions?
A few questions and comments about such an idea, from someone who hasn't used PyPy yet and doesn't understand the setup involved. 1) Would PyPy be required to build this as a C-compatible library, such that CPython could use it as an extension module? That is, would it make PyPy a required part of building CPython? 2) Are there benchmarks comparing the performance of this implementation to the existing re module and the proposed regex module? 3) How would the maintenance work? Where would the module live "officially"? Does CPython fork it or is it extracted from PyPy in a way it can be installed as an external dependency? How does CPython get changes upstream? 4) I may be remembering wrong, but I recall maintenance ease to be one of the justifications for the regex module. How would your proposal compare? Is a random developer looking to fix a bug in his way going to find this easier or more difficult to get his head around? The idea is interesting.
Cheers, fijal
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/ironfroggy%40gmail.com
-- Read my blog! I depend on your acceptance of my opinion! I am interesting! http://techblog.ironfroggy.com/ Follow me if you're into that sort of thing: http://www.twitter.com/ironfroggy
On Sun, Jun 3, 2012 at 3:06 PM, Calvin Spealman
On Sun, Jun 3, 2012 at 7:49 AM, Maciej Fijalkowski
wrote: Hi
I was reading a bit about the regex module and I would like to present some other solution into speeding up the re module for Python.
So, as a bit of background - pypy has a re compatible module. It's also JITted and it's also exportable as a C library (that is a library you can call from C with C API, not a python extension module). I wonder if it would be worth to put some work into it to make it a library that CPython can use.
On the minus side, the JIT only works on x86 and x86_64, on the plus side, since it's 100% API compatible, it can be used as a _xxx speedup module relatively easy.
Do people have opinions?
A few questions and comments about such an idea, from someone who hasn't used PyPy yet and doesn't understand the setup involved.
1) Would PyPy be required to build this as a C-compatible library, such that CPython could use it as an extension module? That is, would it make PyPy a required part of building CPython?
It depends a bit how we organize stuff. PyPy (as the pypy repository checkout, not the pypy interpreter) would be requires to build necessary C files (and as such also maintenance since the C files are not hand-editable), but pypy would not be required to compile C files.
2) Are there benchmarks comparing the performance of this implementation to the existing re module and the proposed regex module?
I don't think so. It really is reasonably fast in a lot of cases and it can definitely be made faster in more cases. The main power comes from JITting - so you compile a piece of assembler per regex created. I doubt C library can come close to this approach-wise. Of course there will be cases and cases, but generally speaking the approach is superior. It would be cool if someone do the benchmarks how they look like *right now*.
3) How would the maintenance work? Where would the module live "officially"? Does CPython fork it or is it extracted from PyPy in a way it can be installed as an external dependency? How does CPython get changes upstream?
I would honestly hope it can be maintained as a part of pypy and then cpython would just use it. But those are just hopes.
4) I may be remembering wrong, but I recall maintenance ease to be one of the justifications for the regex module. How would your proposal compare? Is a random developer looking to fix a bug in his way going to find this easier or more difficult to get his head around?
I think it's relatively easy since it's python code after all, but what do I know. Someone has to have a look, it lives here - https://bitbucket.org/pypy/pypy/src/default/pypy/rlib/rsre I would like people to have opinions themselves whether it's more or less maintenance effort. On our side, we'll maintain this particular part of code anyway (so it's also easier because you leave it to others). Cheers, fijal
The embedded (in both senses of the term) use cases for CPython pretty much kill the idea, I'm afraid. Those cases are also one of the biggest question marks over incorporating regex wholesale instead of incrementally updating the existing engine to achieve feature parity. Publishing such a JIT compiled module via PyPI would be great, though. Cheers, Nick. -- Sent from my phone, thus the relative brevity :)
On Sun, Jun 3, 2012 at 3:46 PM, Nick Coghlan
The embedded (in both senses of the term) use cases for CPython pretty much kill the idea, I'm afraid.
As I said it can (and should) definitely be optional.
Those cases are also one of the biggest question marks over incorporating regex wholesale instead of incrementally updating the existing engine to achieve feature parity.
Publishing such a JIT compiled module via PyPI would be great, though.
Cheers, Nick.
-- Sent from my phone, thus the relative brevity :)
On the minus side, the JIT only works on x86 and x86_64, on the plus side, since it's 100% API compatible, it can be used as a _xxx speedup module relatively easy.
Do people have opinions?
The main concern for re is not speed, but functionality. The Python re module needs to grow a number of features, and correct a number of bugs. So 100% compatible is actually not good enough. 95% compatible (with the features added and the bugs fixed) would be better. OTOH, sharing the re code with PyPy would be a desirable goal, as would be writing the re code in Python (although SRE already implements significant parts in Python). As a speedup module, it's uninteresting - we want to simplify maintenance, not complicate it. So this can only work if it replaces SRE. Regards, Martin
On Sun, Jun 3, 2012 at 5:21 PM, "Martin v. Löwis"
On the minus side, the JIT only works on x86 and x86_64, on the plus
side, since it's 100% API compatible, it can be used as a _xxx speedup module relatively easy.
Do people have opinions?
The main concern for re is not speed, but functionality. The Python re module needs to grow a number of features, and correct a number of bugs. So 100% compatible is actually not good enough. 95% compatible (with the features added and the bugs fixed) would be better.
OTOH, sharing the re code with PyPy would be a desirable goal, as would be writing the re code in Python (although SRE already implements significant parts in Python).
We did not reimplement those parts in RPython, they're still in python (so the sre engine does not accept regex, but instead the lower-level description, etc. etc.)
As a speedup module, it's uninteresting - we want to simplify maintenance, not complicate it. So this can only work if it replaces SRE.
Regards, Martin
Maciej Fijalkowski
On Sun, Jun 3, 2012 at 5:21 PM, "Martin v. Löwis"
wrote: On the minus side, the JIT only works on x86 and x86_64, on the plus
side, since it's 100% API compatible, it can be used as a _xxx speedup module relatively easy.
Do people have opinions?
The main concern for re is not speed, but functionality. The Python re module needs to grow a number of features, and correct a number of bugs. So 100% compatible is actually not good enough. 95% compatible (with the features added and the bugs fixed) would be better.
From my point of view, for textual data reduction, the MRAB regex now has substantial improvements which enable very different kinds of uses, like "named lists" and "fuzzy" matching, which I don't believe occur (together) in any other RE library. Along with features it shares with the existing CPython "re" library, such as the ability to handle very large RE's (which IronPython, for instance, is unable to handle, apparently due to its use of the standard .NET RE library). And do so fairly efficiently.
Bill
OTOH, sharing the re code with PyPy would be a desirable goal, as would be writing the re code in Python (although SRE already implements significant parts in Python).
We did not reimplement those parts in RPython, they're still in python (so the sre engine does not accept regex, but instead the lower-level description, etc. etc.)
As a speedup module, it's uninteresting - we want to simplify maintenance, not complicate it. So this can only work if it replaces SRE.
Regards, Martin
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/bill%40janssen.org
2012/6/3 Maciej Fijalkowski
Hi
I was reading a bit about the regex module and I would like to present some other solution into speeding up the re module for Python.
IMO, the most important feature of the regex module is that it fixes long standing bugs and includes long requested features especially with respect to Unicode. That it's faster is only windfall. -- Regards, Benjamin
participants (6)
-
"Martin v. Löwis"
-
Benjamin Peterson
-
Bill Janssen
-
Calvin Spealman
-
Maciej Fijalkowski
-
Nick Coghlan