[Python-Dev] New regex module for 3.2?

Georg Brandl g.brandl at gmx.net
Thu Jul 22 13:34:35 CEST 2010


Am 13.07.2010 15:35, schrieb Antoine Pitrou:
> On Tue, 13 Jul 2010 15:20:23 +0100
> Michael Foord <fuzzyman at voidspace.org.uk> wrote:
>> On 13/07/2010 15:17, Reid Kleckner wrote:
>> > On Mon, Jul 12, 2010 at 2:07 PM, Nick Coghlan<ncoghlan at gmail.com>  wrote:
>> >    
>> >> MRAB's module offers a superset of re's features rather than a subset
>> >> though, so once it has had more of a chance to bake on PyPI it may be
>> >> worth another look.
>> >>      
>> > I feel like the new module is designed to replace the current re
>> > module, and shouldn't need to spend time in PyPI.  A faster regex
>> > library isn't going to motivate users to add external dependencies to
>> > their projects.
>> >
>> >    
>> If the backwards compatibility issues can be addressed and MRAB is 
>> willing to remain as maintainer then the advantages seem well worth it 
>> to me.
> 
> To me as well. The code needs a full review before integrating, though.

FWIW, I've now run the Pygments test suite (Pygments has about 2500 regular
expressions that are exercised there) and only had two problems:

* Scoped flags: A few lexers use (?s) and similar flags at the end of
  the expression, which has no effect in regex currently.

* POSIX character classes: One regex used a class '[][:xyz]', so the [:
  was seen as the start of a character class.  I'm not sure how common
  this is, as most people seem to escape brackets in character classes.
  Also, it gives a clear error on regex.compile(), not "mysterious"
  failures.

Timings (seconds to run the test suite):

re     26.689  26.015  26.008
regex  26.066  25.797  25.865

So, I thought there wasn't a difference in performance for this use case
(which is compiling a lot of regexes and matching most of them only a
few times in comparison).  However, I found that looking at the regex
caching is very important in this case: re._MAXCACHE is by default set to
100, and regex._MAXCACHE to 1024.  When I set re._MAXCACHE to 1024 before
running the test suite, I get times around 18 (!) seconds for re.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.



More information about the Python-Dev mailing list