[Tutor] Entity to UTF-8 [low level C details on regexmodule.c]
Paul Tremblay
phthenry@earthlink.net
Thu May 8 18:16:02 2003
Huh! Funny, but the books always tell you "use re.compile", a rule
which I have followed faitfully, even though I only use at most 10
regular expressions in my script.
Looks like I can cut out some code!
Thanks
Paul
On Thu, May 08, 2003 at 12:15:46PM -0700, Danny Yoo wrote:
>
>
>
> On Wed, 30 Apr 2003, Paul Tremblay wrote:
>
> > You probably already know this already, but I thought I'd offer it
> > anyway.
> >
> > Your code has the lines:
> >
> > patt = '&#([^;]+);'
> >
> > ustr = re.sub(patt, ToUTF8, ustr)
> >
> > I believe this is ineffecient, because python has to compile the regular
> > expression each time. This code should be more effecient:
> >
> > patt = re.compile(r'&#[^;];')
>
>
>
> Hi Paul,
>
>
> Actually, there's a very low level implementation detail that, in the
> common case, improves our situation here. The last time I checked,
> Python's regular expression engine does cache the last few regular
> expressions that we use via the functions sub(), match(), and search().
> So it might not be so necessary to do an re.compile() in his program.
>
>
> Python's current regular expression engine, 're', uses the internal module
> 'sre' by default, and there's a section of 'sre' that defines a cache of
> regular expressions:
>
>
> ### sre.py
> _cache = {}
> _cache_repl = {}
> # some code cut
> _MAXCACHE = 100
> ###
>
>
> So the first 100 regular expressions processed by Python are automatically
> compiled and saved internally in the 're' module itself. So when we try
> reusing a particular old regular expression, Python can pick it out of the
> cache. This caching behavior is not something that we should really
> depend on, but it's good to know that it's there.
>
>
>
>
>
> [C code ahead]
>
> For the curious C programmers among us, in Python 1.52, this sort of
> caching was much more limited: the old regex engine only cached the very
> last regular expression! We can look at the relevant function in
> Modules/regexmodule.c, in the update_cache() function:
>
>
> /******/
> static PyObject *cache_pat;
> static PyObject *cache_prog;
>
> static int
> update_cache(PyObject *pat)
> {
> PyObject *tuple = Py_BuildValue("(O)", pat);
> int status = 0;
>
> if (!tuple)
> return -1;
>
> if (pat != cache_pat) {
> Py_XDECREF(cache_pat);
> cache_pat = NULL;
> Py_XDECREF(cache_prog);
> cache_prog = regex_compile((PyObject *)NULL, tuple);
> if (cache_prog == NULL) {
> status = -1;
> goto finally;
> }
> cache_pat = pat;
> Py_INCREF(cache_pat);
> }
> finally:
> Py_DECREF(tuple);
> return status;
> }
> /******/
>
>
> Notice that there's some static variables here for maintaining some
> memory. The idea of update_cache is this: on every call to a regular
> expression matching function, Python uses update_cache() to check to see
> if can reuse work that it's done on the very last regex call. If the very
> last regular expression we used is the same as the one we're doing now, we
> reuse that regex object without recompiling the expression.
>
>
> Sorry about diving into C code like this! It's just that I thought that
> this optimization detail was cute: it covers the common case when we're
> only dealing with a single regular expression repeatedly in a loop.
>
>
> But even so, it apparently made more sense in later versions of Python to
> yank the cache out of the C code entirely, and to maintain it externally
> in the 'sre' Python module.
>
>
>
>
>
> Good luck to you!
>
>
> _______________________________________________
> Tutor maillist - Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor
--
************************
*Paul Tremblay *
*phthenry@earthlink.net*
************************