On Sun, Jun 3, 2012 at 3:06 PM, Calvin Spealman <ironfroggy@gmail.com> wrote:
On Sun, Jun 3, 2012 at 7:49 AM, Maciej Fijalkowski <fijall@gmail.com> wrote:
> Hi
>
> I was reading a bit about the regex module and I would like to present some
> other solution into speeding up the re module for Python.
>
> So, as a bit of background - pypy has a re compatible module. It's also
> JITted and it's also exportable as a C library (that is a library you can
> call from C with C API, not a python extension module). I wonder if it would
> be worth to put some work into it to make it a library that CPython can use.
>
> On the minus side, the JIT only works on x86 and x86_64, on the plus side,
> since it's 100% API compatible, it can be used as a _xxx speedup module
> relatively easy.
>
> Do people have opinions?

A few questions and comments about such an idea, from someone who
hasn't used PyPy yet and doesn't understand the setup involved.

1) Would PyPy be required to build this as a C-compatible library,
such that CPython could use it as an extension module? That is, would
it make PyPy a required part of building CPython?

It depends a bit how we organize stuff. PyPy (as the pypy repository checkout, not the pypy interpreter) would be requires to build necessary C files (and as such also maintenance since the C files are not hand-editable), but pypy would not be required to compile C files.
 

2) Are there benchmarks comparing the performance of this
implementation to the existing re module and the proposed regex
module?

I don't think so. It really is reasonably fast in a lot of cases and it can definitely be made faster in more cases. The main power comes from JITting - so you compile a piece of assembler per regex created. I doubt C library can come close to this approach-wise. Of course there will be cases and cases, but generally speaking the approach is superior. It would be cool if someone do the benchmarks how they look like *right now*.
 

3) How would the maintenance work? Where would the module live
"officially"? Does CPython fork it or is it extracted from PyPy in a
way it can be installed as an external dependency? How does CPython
get changes upstream?

I would honestly hope it can be maintained as a part of pypy and then cpython would just use it. But those are just hopes.
 

4) I may be remembering wrong, but I recall maintenance ease to be one
of the justifications for the regex module. How would your proposal
compare? Is a random developer looking to fix a bug in his way going
to find this easier or more difficult to get his head around?

I think it's relatively easy since it's python code after all, but what do I know. Someone has to have a look, it lives here - https://bitbucket.org/pypy/pypy/src/default/pypy/rlib/rsre I would like people to have opinions themselves whether it's more or less maintenance effort. On our side, we'll maintain this particular part of code anyway (so it's also easier because you leave it to others).
 
Cheers,
fijal