[Python-Dev] Where to report performance of re? [was: (no subject)]

Fri Dec 29 05:25:05 EST 2017

Franklin? Lee writes:
 > On Tue, Dec 26, 2017 at 2:01 AM, Yogev Hendel <yogev at intsights.com> wrote:
 > >
 > > I don't know if this is the right place to put this,
 > > but I've found the following lines of code results in an incredibly long
 > > processing time.

 > (I think the correct place is python-list. python-dev is primarily for
 > the developers of Python itself.

I read this as a report of a performance problem, perhaps a
regression.  IMO YMMV, Python-list would be appropriate only if the
reporter was already aware that some regular expressions are expected
to be disastrously non-performant on some target strings, and wanted
help dealing with that.  That doesn't seem to be the case (and your
response also presumes ignorance of the performance properties of re).

 > Bug reports and feature requests belong in
 > https://bugs.python.org/ (where your post could also have gone).)

If you really meant that, you should have avoided rewarding him with a
wordy explanation. ;-)  N.B. I've bookmarked yours, it's excellent!
Thank you very much!

I think even in this case an experienced community member probably
should report on the tracker, though.  If all they got was, "WONTFIX:
that's how the algorithm works" there, *then* they go to python-list.

 > (assuming P != NP): Perl allows (introduced?) backreferences, which

I've been using them in Emacsen since 1987 or so.

 > are NP-hard[1]. Perl also added some other features which complicate
 > things, but backreferences are enough.
 > 
 > The user-level solution is to understand how regexes are executed, and
 > to work around it.

This is the Pythonic approach<0.5 wink/>, in the sense that we haven't
gone to the trouble of trying to improve the algorithm yet.

 > 2. In theory, we can use the textbook algorithm when possible, and the
 > backtracking algorithm when necessary. However, the textbook version
 > won't necessarily be faster, and may take more time to create, so
 > there's a tradeoff here.

There may be a question of the difficulty of maintenance, as well.
The linear time algorithms are somewhat more delicate, and we're still
occasionally discussing what semantics a regular expression should
have (null matches), let alone whether they're implemented correctly.