[pypy-dev] Regular Expression in PyPy

Mon Jan 31 03:19:17 CET 2005

In the checkin message of r8719, hpk wrote:
> Ok, it seems we had a mix of python 2.4 and 2.3 moduels with
> slight changes and were borrowing _sre from CPython anyway.
> So let's back out our changes in pypy/lib and see how far we
> get (if nobody minds, Seo?)

But I mind! Let me explain this...

Just before the checkin (r8718), there were following files in
pypy/lib related to regular expression: re.py, dumbre.py, plexre.py,
sre_adapt.py, sre_parse.py. You deleted re.py, sre_adapt.py,
sre_parse.py.

re.py contained (intended to contain) backend-neutral interface and
utilities, like this:

def match(pattern, string, flags=0):
    return compile(pattern, flags).match(string)

And compile() does memoizing, just as CPython, etc. And re.escape()
copied from CPython. These are backend-neutral.

re.py imported Pattern class from the backend. Pattern's initializer
has signature __init__(self, pattern, flags), and has methods of
C type _sre.SRE_Pattern, like match, search, etc.

And there were three backends (draft of backends) providing this
Pattern class: dumbre.py, plexre.py, sre_adapt.py. Only sre_adapt.py
imports CPython's _sre, others don't!

You can read dumbre.py and plexre.py now, since you didn't delete
them by mistake. :-)

See, "dumbre.py" is exactly what you're referring as "see how far
we get". It exits as soon as regular expression is used, and reports
what was used. And this was the default re.py imported. (see below)

"plexre.py" is more interesting, and it uses Plex to implement (for now)
match(), but returns bool rather than full Match object. This is
sufficient for all "if re.match(pattern, string):" tests, and this let
pickle import and run, unmodified. (As I wrote in the log of r8636...)

pickle uses regular expression only once, as following:

__all__.extend([x for x in dir() if re.match("[A-Z][A-Z0-9_]+$",x)])

This also means PyPy interprets all of Plex just fine. Since Plex does
regular expression parsing, NFA->DFA transformation, state machine all
in pure Python, this is actually quite cool.

"sre_adapt.py" is, as you see, cheating. It contains a single line:

Pattern = sre_compile.compile

And this does not work. Currently C types are not succesfully faked,
so _sre.SRE_Pattern instances are created (well, see below) but all
method calls will result in long traceback. I asked why is this on IRC
long time ago, and I heard that it's because this C type lacks __dict__.

Same applies for _random.Random, I think. That's why we have pure Python
random.py from 2.2.3 in lib directory now.

And why is sre_parse.py patched? Because, without those, no
_sre.SRE_Pattern instances will be created, because PyPy fails to
interpret some part of sre_parse.py. How to reproduce this problem:

(with current PyPy)
>>>> re.compile('a*')
Traceback (application level):
(snip)
AttributeError: getwidth

Why is this? I wrote my reasoning on the log of this file... r3332 and
r3456. To show what's going on, I will give an example:

class A:
    def getwidth(self):
        print 'hahaha'
class B:
    def getwidth(self):
        print 'lalala'
class C:
    def __getitem__(self, index):
        return A()
    def __getslice__(self, start, stop):
        return B()
c = C()
c[0].getwidth()
c[0:1].getwidth()

CPython prints hahaha and lalala. PyPy prints both hahaha. And sre_parse
uses __getslice__... And this is the only place getslice is used, in entire
Python standard library! (except those in UserList and UserString)

And getslice is deprecated since release 2.0, as officially announced in
Language Reference 3.3.6.

Okay, can I back your deletion now? :-)

Seo Sanghyeon