Regular Expressions, Speed, Python, and NFA
malik.a.rumi at gmail.com
Fri Apr 14 11:12:00 EDT 2017
I am running some tests using the site regex101 to figure out the correct regexs to use for a project. I was surprised at how slow it was, constantly needing to increase the timeouts. I went Googling for a reason, and solution, and found Russ Cox’s article from 2007: https://swtch.com/~rsc/regexp/regexp1.html . I couldn’t understand why, if this was even remotely correct, we don’t use NFA in Python, which led me here:
https://groups.google.com/forum/#!msg/comp.lang.python/L1ZFI_R2hAo/C12Nf3patWIJ;context-place=forum/comp.lang.python where all of these issues were addressed. Unfortunately, this is also from 2007.
BTW, John Machin in one of his replies cites Navarro’s paper, but that link is broken. Navarro’s work can now be found at http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.21.3112&rep=rep1&type=pdf But be forewarned, it is 68 pages of dense reading. I am not a computer science major. I am not new to Python, but I don’t think I’m qualified to take on the idea of creating a new NFA module for Python.
>Getting back to the "It would be nice ..." bit: yes, it would be nice
>to have even more smarts in re, but who's going to do it? It's not a
>"rainy Sunday afternoon" job :-)
>Well, just as an idea, there is a portable C library for this at
>http://laurikari.net/tre/ released under LGPL. If one is willing to
>give up PCRE extensions for speed, it might be worth the work to
>wrap this library using SWIG.
(BTW, this link is also old. TRE is now at https://github.com/laurikari/tre/ )
I am not a computer science major. I am not new to Python, but I don’t think I’m qualified to take on the idea of creating a new NFA module for Python. Nor am I entirely sure I want to try something new (to me) like TRE.
Most threads related to this topic are older than 2007. I did find this https://groups.google.com/forum/#!searchin/comp.lang.python/regex$20speed%7Csort:relevance/comp.lang.python/O7rUwVoD2t0/NYAQM0mUX7sJ from 2011 but I did not do an exhaustive search.
The bottom line is I wanted to know if anything has changed since 2007, and if there is a) any hope for improving regex speeds in Python, or b) some 3rd party module/library that is already out there and solves this problem? Or should I just take this advice?
>The cheap way in terms of programmer time is to pipe out to grep or
>awk on this one.
More information about the Python-list