[PYTHON DOC-SIG] New regex module (long)

Robin Friedrich friedric@phoenix.net
Sat, 19 Oct 1996 11:27:57 -0500

Ka-Ping Yee wrote in reply to email from me:
> Hi, Robin.  Look out -- this is going to be long.
> > I believe, as do others, that the Python regex module
> > needs a serious upgrade.
> Yup.  I agree.
> I mentioned to Guido a little while ago that a "text processing SIG"
> would be a good idea, especially as the DOC-SIG starts to use and need
> text-processing features more extensively.
> > Ping's patches have helped me on occasion but I
> > can't code with them in mind if I want my stuff to be used by others. So
> > I will propose a development SIG at the workshop to come up with a plug
> > replacement for regexmodule.c such that Guido will be tempted to drop it
> > in by 1.5 (2.0 failing that).
> Actually, i've already been working on exactly such a module for a
> little
> while now (ever since making those patches) -- i'm trying to do a
> complete
> rewrite of a regular expression module.  I haven't told anybody about it
> yet, because i wanted to get something working and to actually run some
> tests and speed comparisons to see if it's any good.  It isn't complete,
> and i'm still learning as i go (this is the first time i've tried to
> code
> regex functionality), but it is coming along reasonably well and i have
> lots of ideas that i am trying to incorporate.
> But, since we want the same thing in the end, i might as well just stop
> keeping secrets and spill all the beans to you in the hope that i can
> at least make a useful contribution of concepts.  This letter will try
> to be a summary of all the ideas i've come up with so far.
> My main goals (which seem to fit quite nicely with yours) are:
>   (a) no global state (complete thread safety)
>   (b) almost-perfect compatibility with the existing regex module
>   (c) speed
>   (d) extra syntax features
> (The ordering does not intend to express priority.)
> I would like this module to supplant both regexmodule.c and pregex
> to a large degree, although it cannot entirely replace pregex for
> the reasons explained in (b).
> (a) no global state
> There is an apparent contradiction between (a) and the ability to choose
> variant syntaxes.  To deal with this, i've come up with a scheme where
> the most commonly-used syntax tables (currently perl, emacs, egrep,
> grep,
> awk, posix) are held static and referenced, but if the user requests to
> change individual flags, a new syntax table is created and kept within
> the compiled regex structure.
> (My term for "compiled regex" is "pat", for "pattern".  I now use "prog"
> to mean just the program-bytecode part of the "pat" structure; a "pat"
> will contain other things like the syntax table and optimization data
> like a fastmap, etc.)
> (b) almost-perfect compatibility with the existing regex module
> "Almost-perfect" here means as close as possible, but i intend to fix
> obvious bugs in the implementation.  For example, as i pointed out on
> the Python list a little while ago, backreference matching does not
> backtrack correctly.  I think this should be fixed.  However, i feel
> that the new module should be made as compatible as possible in other
> respects, for if it isn't a "drop-in" replacement, it probably won't
> make it into Python version N.
> > 1) Should this be based on or incorporate the pregexmodule that's out
> > there? or start from scratch with a non-POSIX approach?
> For me, this is a compatibility issue.  As Jeffrey Friedl explained on
> the Python list, there are two quite different ways to select which
> match to return.  There is the POSIX way -- the longest overall match --
> and there is the way that the current regex module (along with Perl,
> Tcl,
> and others) do it, which is summarized in the Tcl manual by:
>     1.  If a regular expression could match two different parts of an
>         input string, then it will match the one that begins earliest.
>     2.  If a regular expression contains | operators then the leftmost
>         matching subexpression is chosen.
>     3.  In *, +, and ? constructs, longer matches are chosen in
>         preference to shorter ones.
>     4.  In sequences of expression components, the components are
>         considered from left to right.
> in order of precedence.  I feel that this latter is the way to go.
> Although it is not POSIX compliant, it is the way programmers are
> used to dealing with regular expressions in most scripting languages
> (including current Python) and is somewhat more predictable than the
> POSIX leftmost-longest rule.  Because the two schemes can yield such
> different results, i think it's necessary to do things the latter way
> to maintain compatibility with the existing regex module.
> (c) speed!
> > 2) I feel performance vs perl is an issue. How far should we strive for
> > in speed?
> I agree that performance versus Perl is an important issue.  Although
> we are not at war, it is true that Python and Perl can be considered
> "competing" languages, and a big way to find new Python users is to
> show Perl users what they can gain without having to admit a great
> deal that they will lose.  I don't like to be ashamed of Python.  So,
> how far should we strive?
> Pretty far, i think.  My current timings show that the Python 1.4b3
> regex module handles "ordinary" (i.e. non-pathological; what i consider
> to be commonly-used kinds) regular expressions approximately 6 times
> slower than Perl does.  This is a serious hit for programmers (CGI among
> others), and so serious that i am almost afraid to tell anyone for fear
> it will make them quit Python.  Worse, i found that pregex is 2.5 times
> slower than *that*, which amounts to a some 15 times slower than Perl.
> I don't intend to offend the implementors of these modules, but i do
> feel that a speed improvement is needed.  And because of the room for
> improvement, people will get more excited if we say regexes are e.g.
> "3 times faster now" rather than "only half as fast as Perl's". :)
> How can this be accomplished?  I haven't looked much at the Perl source.
> But here are some of my own thoughts on speed/efficiency issues:
>     1.  A survey of a few scripts led me to discover that by far the
>         majority of regular expressions actually represent *very*
>         *simple* things and often don't require the full power of a
>         regular expression (well, according to my survey, anyway).
>         This means that there are lots of opportunities for optimization
>         everywhere.  I think it makes sense to target the most common
>         idioms in regular expressions and make them as fast as possible
>         (even at the expense of expressions that fall outside these
>         special cases).
>         My way of doing this is by introducing a whole host of extra
>         opcodes for very specific purposes, to reduce cycles through
>         the bytecode interpreter and give us more chances to run loops
>         in C.  For example, possibly the single most common idiom in
>         Perl regular expressions is "\s*" (slurp whitespace).  There
>         is almost never any need for backtracking because the next
>         thing in the regex is almost always non-whitespace.  So this
>         can be replaced with a single opcode which makes a C routine
>         skip all available whitespace, fast.
>         This particular optimization method can be generalized: any
>         character class followed by a maximal-matching operator can
>         be converted into a single loop if the set of possible matches
>         following it does not intersect that character class.
>     2.  With the similar goal of keeping execution in C as much as
>         possible, we can also incorporate extra regex features.
>         This will be dealt with below, but i wanted to point out that
>         it is also an important speed issue.  A single new feature,
>         which activates a compact and optimized C routine, can be
>         tremendously faster than the equivalent, complicated regular
>         expression as it would have had to have been expressed
>         without using the feature.
>     3.  One of the biggest things that bothers me to no end about
>         Python is the immutability of strings.  It cripples you.
>         You can't pass around large files.  You can't cut up and
>         append long strings.  Every time you try, you have to wait
>         for the entire string to get duplicated.  This is awful,
>         and it means quadratic times for linear algorithms.
>         So, one of my ideas has been to build regex functionality
>         into a new "mutable string" class which is capable of faster
>         appends and slices.  For programs which chew through files
>         and want to assemble the results piece by piece, i think this
>         could make a huge difference.

I think a new mutable string type could be very powerful as well.
Maybe exploring synergy with the new array module could bare fruit?
>     4.  The regsub module does exactly that: chews through a string
>         and assembles the results piece by piece.  Because of the
>         immutability of strings, it's quadratic (arrghh!) which makes
>         it extremely slow for large strings.  (And global substitutions
>         on *big* strings is one of the abilities in which Perl excels.)
>         Using mutable strings in regsub would speed it up a lot, but i
>         think the way to go is to do regsub in C once and for all.
> (d) i.  extra regex syntax features
> I have a few ideas on this front, too.  One of the most common
> comments i hear with regard to regular expressions goes something
> like, "yeah, regular expressions are pretty powerful, but they
> can't do everything.  For example, you can't do balanced matching
> in a regular expression."
> Well, why not?  Even something as simple as a new metacharacter:
>     \=xy    immediately match an "x" and skip until the balancing "y"
> commonly used, perhaps, as \=() or \=[] or \={}, would be useful.
> A single use of this in a regular expression (implemented in a quick
> C loop) would easily beat any kind of scan/count/slice loop in Python
> -- or even Perl -- hands down.
Yes, another good idea, but a lot of thought will be needed on this one.

> (As a side note, i now realize that '=' isn't a very good choice since
> it is a non-word character and it is nice to rely on all non-word
> characters being literal when they are backslashes.  So we should pick
> an appropriate letter of the alphabet, say -- \m() for "matching"?)
> I added \l (letter) and \h (hex digit) to my regex patch because i
> find them commonly used and useful, too.  They may be worth considering.
> Also, the interpretation of class metacharacters like \w, \s, etc.
> within [bracketed] classes is also a great timesaver and should be
> on the list.
> > 3) What are the perl5 features which might come in handy?
> Minimal matching with ??, +?, and *? was the most-immediately-useful
> and easiest-to-implement feature, so i did that in my regex patch.
> This is a prime candidate.  Also very good, as Tom Christiansen
> recently pointed out, are the negative look-ahead assertions (?!...).
> Perl purists may complain that we are copying or stealing their syntax,
> but i think there is no reason why we should confuse people who already
> know a perfectly good way of doing these things by making up our own.

Tom also mentioned that Larry closely examined Guido's object system
when implementing the perl5 object support so I don't feel guilty of
anything at all if we borrow as much good regex stuff as possible from
perl5. We will ignore any spurious perluddite comments.
(just tell them immitation is the sincerest form of flattery and leave
it at that;-)

> (d) ii. extra syntax features on strings in general
> Despite possible cries from Python users about it "looking like Perl"
> or "looking like Tcl", i think Python needs string interpolation.
> Although "foo %(spam)s bar" % vars() may be adequate, it's just too
> clumsy for people who are doing text processing.  It may increase
> the amount of typing you have to do by 30-50%, and also increase the
> opportunities for mistakes.
> I'm currently thinking of a string interpolation scheme which is
> slightly more powerful than Perl's.  Python's consistency lets me
> describe it in just two rules.  For conciseness, i'll describe
> the syntax using regular expressions with the above extensions, and
> include rxb-like expressions for readability:
>     1.  $\w+(\m[]|\m()|\.\w+|)*
>         '$' + some(wordchars) + some(
>             either(balanced('[]'), balanced('()'), '.' +
> some(wordchars)))
>     2.  $\m{}
>         '$' + balanced('{}')
Dollar signs sends shivers down my spine, but carry on.

> Either of the above things, when found, is replaced with its
> value when evaluated as a Python expression and then converted
> to a string with 'str' (not 'repr').  In all cases when you
> want to use a literal "[", "(", or "." immediately following
> an interpolation, you can just protect the interpolation with {}
> and nothing changes because it's already an expression.  To get
> a '$' you use '$$'.  These rules make it possible to write:
>     a1. "Here is a $string."
>     a2. "Here is a $module.member."
>     a3. "Here is an $object.member."
>     a4. "Here is a $functioncall(with, arguments)."
>     a5. "Here is an ${arbitrary + expression}."
>     a6. "Here is an $array[3] member."
>     a7. "Here is a $dictionary['foo'] member."
> This is, i think, about as compact as you can get and in my opinion
> very lucid.  The following are some comparisons, with typing length
> differences in parentheses (always relative to a1-a7 above).  The
> Perl equivalents are:
>     b1. "Here is a $string."  (+0)
>     b2. "Here is a $module::member."  (+1)
>     b3. "Here is an $object->{member}."  (+3)
>     b4. "Here is a " . &functioncall($with, $arguments) . "."  (+10)
>     b5. "Here is an " . ($arbitrary + $expression) . "."  (+9)
>     b6. "Here is an $array[3] member."  (+0)
>     b7. "Here is a $dictionary{foo} member."  (-2)
> In Perl 5 you can also do
>     c4. "Here is an @{[&functioncall($with, $arguments)]}."  (+8)
>     c5. "Here is an ${[$arbitrary + $expression]}."  (+4)
> Finally, the current-Python equivalents, for comparison, are:
>     d1. "Here is a " + string + "."  (+7)
>     e1. "Here is a %s." % string  (+4)
>     f1. "Here is a %(string)s." % vars()  (+12)
>     d2. "Here is a " + str(module.member) + "." (+12)
>     e2. "Here is a %s." % module.member  (+4)
>     e3. "Here is an %s." % object.member  (+4)
>     e4. "Here is a %s." % functioncall(with, arguments)  (+4)
>     e5. "Here is an %s." % (arbitrary + expression)  (+4)
>     e6. "Here is an %s member." % array[3]  (+4)
>     e7. "Here is a %s member." % dictionary['foo']  (+4)
> Clearly, typing is not the be-all and end-all of a programming
> language!  But since the above data were easy to collect, i
> thought i'd present them.  You can see that my proposed syntax
> for interpolation wins almost every time (and the design of
> Python should take more credit for that than i do).  Moreover,
> i feel that it is a lot cleaner and more readable than either
> of the current alternatives in Python or Perl.  (I am ignoring
> Tcl on purpose, as it is even far clumsier at this.)

I think this will be the most controversial of your proposals
but well worth the debate. I'll include it as a definate topic
at the workshop.
> So, given the syntax, what is the implementation?
> Trying to force this into anything internal would just not
> work -- at least not initially.  I don't intend to change
> any Python syntax; i was thinking of putting this functionality
> into an "interpolation" module (just in Python to begin with,
> but later in a compiled C module).
> What i had in mind was a class "Itpl" which, much like the
> rxb.Pattern class, holds only a string, but gives it special
> meaning just by virtue of its type.  Itpl.__repr__ would
> actually perform the interpolation.  So a casual user could:
>     print Itpl("Here is a $string.")
> Or the Itpl module could export a "printpl" function as shorthand:
>     printpl("Here is a $string.")
> Why encapsulate this behaviour in a class?  It means we can
> delay the interpolation from happening until the __repr__
> is actually requested.  When __repr__ is called, it should
> evaluate the expressions in its caller's local and global
> namespaces.
> This is relevant to our current discussion because it means
> an Itpl instance could be passed as the replacement argument
> to a regsub, which makes it possible to evaluate real
> expressions for each match to yield the substitution string.
> (The regsub function would then simply define names in the
> Itpl's global name space to represent matching subgroups.)
> This gives us Perl's /e option for evaluating expressions
> during regex substitutions, which is very powerful functionality
> that definitely should be available in Python.  Additionally,
> regsub should probably also accept a Python function object as
> the replacement argument, which it calls on each match.

I agree, (although in the function object case performance will
suffer and shouldn't be relied on too heavily).
> -- Anyway, so there are all of my ideas i've been collecting
> so far on improving text processing in Python.  Combining
> everything in d(i) and d(ii) would give Python a great deal of
> text-processing power, and just might even place it in a
> position to seriously compete with Perl for a larger part of
> this application area.
> > 4) Who would be the best people to work on this?
> > 5) How much work to you see for such an effort?
> A lot of work, probably?  I don't know exactly.  These questions
> are probably the hardest to answer.
> I wanted to keep working on this without committing to it
> in public because i'm not very sure about how much time i'll
> be able to devote to it.  I think this project would have a
> great deal of potential, though, and i'm very interested.
> Actually, i already feel better about it now that i've let
> go with all these ideas and contributed something.
> Okay, well there it is.  I've been at this letter a *lot* longer
> than i'd planned, so i'd better go eat something.  Let me know
> what you think of these ideas, and also if you have issues in
> mind that i've missed.  (Even if you don't have time to respond to
> everything, i'd rather hear your thoughts in pieces.)
> Thanks!
> Ping
>                                  Developer, Alias|Wavefront Inc. (Tokyo)
> http://www.lfw.org/math/ brings math to the Web as easy as <se>?pi?</se>

Ok guys what do you think? I'll be recruiting for contributions of
ideas and coding on the newsgroup as well as at the workshop. I forsee
a major thread in the c.l.p and then we'll probably move it to a TEXT-SIG.
I know I'm tired of having to answer people about why we should use Python
when Perl is so much better at text processing. I'd love to remove that 
thorn from Python's side!  Python has too much going for it to be dismissed
with one easy criticism.

-Robin Friedrich

DOC-SIG  - SIG for the Python Documentation Project

send messages to: doc-sig@python.org
administrivia to: doc-sig-request@python.org