Regular Expressions, Speed, Python, and NFA
rgaddi at highlandtechnology.invalid
Fri Apr 14 13:21:33 EDT 2017
On 04/14/2017 08:12 AM, Malik Rumi wrote:
> I am running some tests using the site regex101 to figure out the correct regexs to use for a project. I was surprised at how slow it was, constantly needing to increase the timeouts. I went Googling for a reason, and solution, and found Russ Cox’s article from 2007: https://swtch.com/~rsc/regexp/regexp1.html . I couldn’t understand why, if this was even remotely correct, we don’t use NFA in Python, which led me here:
> https://groups.google.com/forum/#!msg/comp.lang.python/L1ZFI_R2hAo/C12Nf3patWIJ;context-place=forum/comp.lang.python where all of these issues were addressed. Unfortunately, this is also from 2007.
> BTW, John Machin in one of his replies cites Navarro’s paper, but that link is broken. Navarro’s work can now be found at http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.21.3112&rep=rep1&type=pdf But be forewarned, it is 68 pages of dense reading. I am not a computer science major. I am not new to Python, but I don’t think I’m qualified to take on the idea of creating a new NFA module for Python.
>> Getting back to the "It would be nice ..." bit: yes, it would be nice
>> to have even more smarts in re, but who's going to do it? It's not a
>> "rainy Sunday afternoon" job :-)
>> Well, just as an idea, there is a portable C library for this at
>> http://laurikari.net/tre/ released under LGPL. If one is willing to
>> give up PCRE extensions for speed, it might be worth the work to
>> wrap this library using SWIG.
>> Kirk Sluder
> (BTW, this link is also old. TRE is now at https://github.com/laurikari/tre/ )
> I am not a computer science major. I am not new to Python, but I don’t think I’m qualified to take on the idea of creating a new NFA module for Python. Nor am I entirely sure I want to try something new (to me) like TRE.
> Most threads related to this topic are older than 2007. I did find this https://groups.google.com/forum/#!searchin/comp.lang.python/regex$20speed%7Csort:relevance/comp.lang.python/O7rUwVoD2t0/NYAQM0mUX7sJ from 2011 but I did not do an exhaustive search.
> The bottom line is I wanted to know if anything has changed since 2007, and if there is a) any hope for improving regex speeds in Python, or b) some 3rd party module/library that is already out there and solves this problem? Or should I just take this advice?
>> The cheap way in terms of programmer time is to pipe out to grep or
>> awk on this one.
>> Kirk Sluder
I'll also throw in the obligatory quote from Jamie Zawinsky, "Some
people, when confronted with a problem, think 'I know, I'll use regular
expressions.' Now they have two problems."
It's not that regexes are the wrong tool for any job; I personally use
them all the time. But they're, tautologically, the wrong tool for any
job that can be done better with a different one. In Python, you've got
"in", .startswith, .endswith, and .split to handle simple parsing tasks.
On the other end, you've got lxml and the like to handle complex tasks
that provably cannot be done with regexes at all, let alone efficiently.
This leaves them in a fairly bounded middle ground, where is my task so
complex that it warrants something as difficult to read as a regex, but
still simple enough to be solved by one. And that ground is definitely
occupied. But not vast. So is it worth the time to try to write a more
efficient regex parser for Python? Yours if you want it to be, but not
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order. See above to fix.
More information about the Python-list