[New-bugs-announce] [issue2636] Regexp 2.6 (modifications to current re 2.2.2)
Jeffrey C. Jacobs
report at bugs.python.org
Tue Apr 15 13:57:52 CEST 2008
New submission from Jeffrey C. Jacobs <timehorse at users.sourceforge.net>:
I am working on adding features to the current Regexp implementation,
which is now set to 2.2.2. These features are to bring the Regexp code
closer in line with Perl 5.10 as well as add a few python-specific
niceties and potential speed-ups and clean-ups.
I will be posting regular patch updates to this thread when major
milestones have been reach with a description of the feature(s) added.
Currently, the list of proposed changes are (in no particular order):
1) Fix <a href="http://bugs.python.org/issue433030">issue 433030</a> by
adding support for Atomic Grouping and Possessive Qualifiers
2) Make named matches direct attributes of the match object; i.e.
instead of m.group('foo'), one will be able to write simply m.foo.
3) (maybe) make Match objects subscriptable, such that m[n] is
equivalent to m.group(n) and allow slicing.
4) Implement Perl-style back-references including relative back-references.
5) Add a well-formed, python-specific comment modifier, e.g. (?P#...);
the difference between (?P#...) and Perl/Python's (?#...) is that the
former will allow nested parentheses as well as parenthetical escaping,
so that patterns of the form '(?P# Evaluate (the following) expression,
3\) using some other technique)'. The (?P#...) will interpret this
entire expression as a comment, where as with (?#...) only, everything
following ' expression...' would be considered part of the match.
(?P#...) will necessarily be slower than (?#...) and so only should be
used if richer commenting style is required but the verbose mode is not
6) Add official support for fast, non-repeating capture groups with the
Template option. Template is unofficially supported and disables all
repeat operators (*, + and ?). This would mainly consist of documenting
7) Modify the re compiled expression cache to better handle the
thrashing condition. Currently, when regular expressions are compiled,
the result is cached so that if the same expression is compiled again,
it is retrieved from the cache and no extra work has to be done. This
cache supports up to 100 entries. Once the 100th entry is reached, the
cache is cleared and a new compile must occur. The danger, all be it
rare, is that one may compile the 100th expression only to find that one
recompiles it and has to do the same work all over again when it may
have been done 3 expressions ago. By modifying this logic slightly, it
is possible to establish an arbitrary counter that gives a time stamp to
each compiled entry and instead of clearing the entire cache when it
reaches capacity, only eliminate the oldest half of the cache, keeping
the half that is more recent. This should limit the possibility of
thrashing to cases where a very large number of Regular Expressions are
continually recompiled. In addition to this, I will update the limit to
256 entries, meaning that the 128 most recent are kept.
8) Emacs/Perl style character classes, e.g. [:alphanum:]. For instance,
:alphanum: would not include the '_' in the character class.
9) C-Engine speed-ups. I commenting and cleaning up the _sre.c Regexp
engine to make it flow more linearly, rather than with all the current
gotos and replace the switch-case statements with lookup tables, which
in tests have shown to be faster. This will also include adding many
more comments to the C code in order to make it easier for future
developers to follow. These changes are subject to testing and some
modifications may not be included in the final release if they are shown
to be slower than the existing code. Also, a number of Macros are being
eliminated where appropriate.
10) Export any (not already) shared value between the Python Code and
the C code, e.g. the default Maximum Repeat count (65536); this will
allow those constants to be changed in 1 central place.
11) Various other Perl 5.10 conformance modifications, TBD.
More items may come and suggestions are welcome.
Currently, I have code which implements 5) and 7), have done some work
on 10) and am almost 9). When 9) is complete, I will work on 1), some
of which, such as parsing, is already done, then probably 8) and 4)
because they should not require too much work -- 4) is parser-only
AFAICT. Then, I will attempt 2) and 3), though those will require
changes at the C-Code level. Then I will investigate what additional
elements of 11) I can easily implement. Finally, I will write
documentation for all of these features, including 6).
In a few days, I will provide a patch with my interim results and will
update the patches with regular updates when Milestones are reached.
components: Library (Lib)
title: Regexp 2.6 (modifications to current re 2.2.2)
type: feature request
versions: Python 2.6
Tracker <report at bugs.python.org>
More information about the New-bugs-announce