[Tutor] Entity to UTF-8 [low level C details on regexmodule.c]

Thu May 8 18:16:02 2003

Huh! Funny, but the books always tell you "use re.compile", a rule
which I have followed faitfully, even though I only use at most 10
regular expressions in my script.

Looks like I can cut out some code!

Thanks

Paul

On Thu, May 08, 2003 at 12:15:46PM -0700, Danny Yoo wrote:
> 
> 
> 
> On Wed, 30 Apr 2003, Paul Tremblay wrote:
> 
> > You probably already know this already, but I thought I'd offer it
> > anyway.
> >
> > Your code has the lines:
> >
> > patt = '&#([^;]+);'
> >
> > ustr = re.sub(patt, ToUTF8, ustr)
> >
> > I believe this is ineffecient, because python has to compile the regular
> > expression each time.  This code should be more effecient:
> >
> > patt = re.compile(r'&#[^;];')
> 
> 
> 
> Hi Paul,
> 
> 
> Actually, there's a very low level implementation detail that, in the
> common case, improves our situation here.  The last time I checked,
> Python's regular expression engine does cache the last few regular
> expressions that we use via the functions sub(), match(), and search().
> So it might not be so necessary to do an re.compile() in his program.
> 
> 
> Python's current regular expression engine, 're', uses the internal module
> 'sre' by default, and there's a section of 'sre' that defines a cache of
> regular expressions:
> 
> 
> ### sre.py
> _cache = {}
> _cache_repl = {}
>                            # some code cut
> _MAXCACHE = 100
> ###
> 
> 
> So the first 100 regular expressions processed by Python are automatically
> compiled and saved internally in the 're' module itself.  So when we try
> reusing a particular old regular expression, Python can pick it out of the
> cache.  This caching behavior is not something that we should really
> depend on, but it's good to know that it's there.
> 
> 
> 
> 
> 
> [C code ahead]
> 
> For the curious C programmers among us, in Python 1.52, this sort of
> caching was much more limited: the old regex engine only cached the very
> last regular expression!  We can look at the relevant function in
> Modules/regexmodule.c, in the update_cache() function:
> 
> 
> /******/
> static PyObject *cache_pat;
> static PyObject *cache_prog;
> 
> static int
> update_cache(PyObject *pat)
> {
>         PyObject *tuple = Py_BuildValue("(O)", pat);
>         int status = 0;
> 
>         if (!tuple)
>                 return -1;
> 
>         if (pat != cache_pat) {
>                 Py_XDECREF(cache_pat);
>                 cache_pat = NULL;
>                 Py_XDECREF(cache_prog);
>                 cache_prog = regex_compile((PyObject *)NULL, tuple);
>                 if (cache_prog == NULL) {
>                         status = -1;
>                         goto finally;
>                 }
>                 cache_pat = pat;
>                 Py_INCREF(cache_pat);
>         }
>   finally:
>         Py_DECREF(tuple);
>         return status;
> }
> /******/
> 
> 
> Notice that there's some static variables here for maintaining some
> memory.  The idea of update_cache is this: on every call to a regular
> expression matching function, Python uses update_cache() to check to see
> if can reuse work that it's done on the very last regex call.  If the very
> last regular expression we used is the same as the one we're doing now, we
> reuse that regex object without recompiling the expression.
> 
> 
> Sorry about diving into C code like this!  It's just that I thought that
> this optimization detail was cute: it covers the common case when we're
> only dealing with a single regular expression repeatedly in a loop.
> 
> 
> But even so, it apparently made more sense in later versions of Python to
> yank the cache out of the C code entirely, and to maintain it externally
> in the 'sre' Python module.
> 
> 
> 
> 
> 
> Good luck to you!
> 
> 
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor

-- 

************************
*Paul Tremblay         *
*phthenry@earthlink.net*
************************