Re: [pypy-dev] Adding a feature to re

In a message of Mon, 25 Aug 2014 03:20:55 -0400, Mike Kaplinskiy writes: >Hey folks, > >One of the projects I'm working on in CPython is becoming a little CPU >bound and I was hoping to use pypy. One problem though - one of the pieces >uses the regex library (which claims to be CPython's re-next). Running >regex through cpyext works, but is deadly slow. > >>From reading the docs it seems like I have a few options: > - rewrite all of regex in Python - seems like a bad idea > - rewrite regex to be non-python specific & use cppyy or cffi to interface >with it. I actually looked into this & unfortunately the CPython API seems >quite deep in there. > - get rid of the dependency somehow. What I'm missing are named lists >(basically "L<a>", a=["1","2"] will match 1 or 2). Unfortunately creating >one really long re string is out of the question - I have not seen >compile() finish with that approach. Writing a custom DFA could be on the >table, but I was hoping to avoid that error prone step. > - somehow factor out the part using regex and keep using CPython for it. > - add the missing functionality to pypy's re. This seems like the path of >least resistance. > >I've started looking into the sre module and it looks like quite a few bits >(parsing & compiling to byte code mostly) are reused from CPython. I would >have to change some of those bits. My question is then - is there any hope >of getting these changes upstream then? Do stdlib pieces have a "no touch" >policy? > >Thanks, >Mike. Do you know about https://pypi.python.org/pypi/regex If I were you, I would try to get the behaviour you want put into the new replacement version -- which would, of course, be easiest if you contributed the code. Then we can see about having pypy do the same ... Laura

The regex library I meant is that very one. Named lists are a feature there but not in cpython's or pypy's re. On Monday, August 25, 2014, Laura Creighton <lac@openend.se> wrote: > In a message of Mon, 25 Aug 2014 03:20:55 -0400, Mike Kaplinskiy writes: > >Hey folks, > > > >One of the projects I'm working on in CPython is becoming a little CPU > >bound and I was hoping to use pypy. One problem though - one of the pieces > >uses the regex library (which claims to be CPython's re-next). Running > >regex through cpyext works, but is deadly slow. > > > >>From reading the docs it seems like I have a few options: > > - rewrite all of regex in Python - seems like a bad idea > > - rewrite regex to be non-python specific & use cppyy or cffi to > interface > >with it. I actually looked into this & unfortunately the CPython API seems > >quite deep in there. > > - get rid of the dependency somehow. What I'm missing are named lists > >(basically "L<a>", a=["1","2"] will match 1 or 2). Unfortunately creating > >one really long re string is out of the question - I have not seen > >compile() finish with that approach. Writing a custom DFA could be on the > >table, but I was hoping to avoid that error prone step. > > - somehow factor out the part using regex and keep using CPython for it. > > - add the missing functionality to pypy's re. This seems like the path of > >least resistance. > > > >I've started looking into the sre module and it looks like quite a few > bits > >(parsing & compiling to byte code mostly) are reused from CPython. I would > >have to change some of those bits. My question is then - is there any hope > >of getting these changes upstream then? Do stdlib pieces have a "no touch" > >policy? > > > >Thanks, > >Mike. > > Do you know about > https://pypi.python.org/pypi/regex > > If I were you, I would try to get the behaviour you want put into the > new replacement version -- which would, of course, be easiest if you > contributed the code. Then we can see about having pypy do the same ... > > Laura >

Hi Mike, On 26 August 2014 04:36, Mike Kaplinskiy <mike.kaplinskiy@gmail.com> wrote:
The regex library I meant is that very one. Named lists are a feature there but not in cpython's or pypy's re.
The regular expression library is a bit special inside PyPy: its core engine has to be written as RPython code in order to benefit from a regular-expression-aware JIT. (If we wrote it in pure Python, it would be significantly slower.) This core is a bytecode interpreter (a different one than Python's, obviously) in a module called "_sre" --- same name as the corresponding C module in CPython. When Python code does "import re", on either PyPy or CPython, it is also importing some pure Python code for the re.compile() part; only the execution of the compiled regular expressions is done by "_sre". What would likely be the best approach would be to add new bytecodes to the same core engine, for example to support the named lists. These new bytecodes would never be produced by the pure Python parts of the "re" module, so they wouldn't have any impact on that. Then you can write or adapt a pure Python "regex" module. It would compile regex-compatible extended regular expressions down to a format that can be used by the same core engine --- using the extra bytecodes as well. If you end up supporting the complete "regex" syntax this way, then we'd be happy to distribute it included inside PyPy, as a pre-installed module (or, depending on how it turns out, as a separate module that needs to be pip-installed --- but it looks saner to include it with PyPy anyway, given that it depends on changes to PyPy's own built-in "_sre" module). A bientôt, Armin.

The regex library I meant is that very one. Named lists are a feature there but not in cpython's or pypy's re. On Monday, August 25, 2014, Laura Creighton <lac@openend.se> wrote: > In a message of Mon, 25 Aug 2014 03:20:55 -0400, Mike Kaplinskiy writes: > >Hey folks, > > > >One of the projects I'm working on in CPython is becoming a little CPU > >bound and I was hoping to use pypy. One problem though - one of the pieces > >uses the regex library (which claims to be CPython's re-next). Running > >regex through cpyext works, but is deadly slow. > > > >>From reading the docs it seems like I have a few options: > > - rewrite all of regex in Python - seems like a bad idea > > - rewrite regex to be non-python specific & use cppyy or cffi to > interface > >with it. I actually looked into this & unfortunately the CPython API seems > >quite deep in there. > > - get rid of the dependency somehow. What I'm missing are named lists > >(basically "L<a>", a=["1","2"] will match 1 or 2). Unfortunately creating > >one really long re string is out of the question - I have not seen > >compile() finish with that approach. Writing a custom DFA could be on the > >table, but I was hoping to avoid that error prone step. > > - somehow factor out the part using regex and keep using CPython for it. > > - add the missing functionality to pypy's re. This seems like the path of > >least resistance. > > > >I've started looking into the sre module and it looks like quite a few > bits > >(parsing & compiling to byte code mostly) are reused from CPython. I would > >have to change some of those bits. My question is then - is there any hope > >of getting these changes upstream then? Do stdlib pieces have a "no touch" > >policy? > > > >Thanks, > >Mike. > > Do you know about > https://pypi.python.org/pypi/regex > > If I were you, I would try to get the behaviour you want put into the > new replacement version -- which would, of course, be easiest if you > contributed the code. Then we can see about having pypy do the same ... > > Laura >

Hi Mike, On 26 August 2014 04:36, Mike Kaplinskiy <mike.kaplinskiy@gmail.com> wrote:
The regex library I meant is that very one. Named lists are a feature there but not in cpython's or pypy's re.
The regular expression library is a bit special inside PyPy: its core engine has to be written as RPython code in order to benefit from a regular-expression-aware JIT. (If we wrote it in pure Python, it would be significantly slower.) This core is a bytecode interpreter (a different one than Python's, obviously) in a module called "_sre" --- same name as the corresponding C module in CPython. When Python code does "import re", on either PyPy or CPython, it is also importing some pure Python code for the re.compile() part; only the execution of the compiled regular expressions is done by "_sre". What would likely be the best approach would be to add new bytecodes to the same core engine, for example to support the named lists. These new bytecodes would never be produced by the pure Python parts of the "re" module, so they wouldn't have any impact on that. Then you can write or adapt a pure Python "regex" module. It would compile regex-compatible extended regular expressions down to a format that can be used by the same core engine --- using the extra bytecodes as well. If you end up supporting the complete "regex" syntax this way, then we'd be happy to distribute it included inside PyPy, as a pre-installed module (or, depending on how it turns out, as a separate module that needs to be pip-installed --- but it looks saner to include it with PyPy anyway, given that it depends on changes to PyPy's own built-in "_sre" module). A bientôt, Armin.
participants (3)
-
Armin Rigo
-
Laura Creighton
-
Mike Kaplinskiy