[issue2636] Regexp 2.7 (modifications to current re 2.2.2)

Jeffrey C. Jacobs report at bugs.python.org
Tue Mar 10 13:00:48 CET 2009


Jeffrey C. Jacobs <timehorse at users.sourceforge.net> added the comment:

Okay, as I said, Atomic Grouping, etc., off a recent 2.6 is already 
available and I can do any cleanups requested to those already 
mentioned, I just don't want to start any new items at the moment.  As 
it is, we are still over a year from any of this seeing the light of day 
as it's not going to be merged until we start 2.7 / 3.1 alpha.

Fortunately, I think Matthew here DOES have a lot of potential to have 
everything wrapped up by then, but I think to summarize everyone's 
concern, we really would like to be able to examine each change 
incrementally, rather than as a whole.  So, for the purposes of this, I 
would recommend that you, Matthew, make a version of your new engine 
WITHOUT any Atomic Group, variable length look behind / ahead 
assertions, reverse string scanning, positional, negated or scoped 
inline flags, group key indexing or any other feature described in the 
various issues, and that we then evaluate purely on the merits of the 
engine itself whether it is worth moving to that engine, and having made 
that decision officially move all work to that design if warranted.  
Personally, I'd like to see that 'pure' engine for myself and maybe we 
can all develop an appropriate benchmark suite to test it fairly against 
the existing engine.  I also think we should consider things like 
presentation (are all lines terminated by column 80), number of 
comments, and general readability.  IMHO, the current code is conformant 
in the line length, but VERY deficient WRT comments and readability, the 
later of which it sacrifices for speed (as well as being retrofitted for 
iteration rather than recursion).  I'm no fan of switch-case, but I 
found that by turning the various case statements into bite-sized 
functions and adding many, MANY comments, the code became MUCH more 
readable at the minor cost of speed.  As I think speed trumps 
readability (though not blindly), I abandoned my work on the engines, 
but do feel that if we are going to keep the old engine, I should try 
and adapt my comments to the old framework to make the current code a 
bit easier to understand since the framework is more or less the same 
code as in the existing engine, just re-arranged.

I think all of the things you've added to your engine, Matthew, can, 
with varying levels of difficulty be implemented in the existing Regexp 
Engine, though I'm not suggesting that we start that effort.  Simply, 
let's evaluate fairly whether your engine is worth the switch over.  
Personally, I think the engine has some potential -- though not much 
better than current WRT readability -- but we've only heard anecdotal 
evidence of it's superior speed.  Even if the engine isn't faster, 
developing speed benchmarks that fairly gage any potential new engine 
would be handy for the next person to have a great idea for a rewrite, 
so perhaps while you peruse the stripped down version of your engine, 
the rest of us can work on modifying regex_tests.py, test_re.py and 
re_tests.py in Lib/test specifically for the purpose of benchmarking.

If we can focus on just these two issues ('pure' engine and fair 
benchmarks) I think I can devote some time to the later as I've dealt a 
lot with benchmarking (WRT the compiler-cache) and test cases and hope 
to be a bit more active here.

----------
message_count: 69.0 -> 70.0

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue2636>
_______________________________________


More information about the Python-bugs-list mailing list