re.search much slower then grep on some regular expressions

John Machin sjmachin at lexicon.net
Wed Jul 9 10:48:54 CEST 2008


On Jul 9, 2:01 am, Kris Kennaway <k... at FreeBSD.org> wrote:
> samwyse wrote:
> > On Jul 4, 6:43 am, Henning_Thornblad <Henning.Thornb... at gmail.com>
> > wrote:
> >> What can be the cause of the large difference between re.search and
> >> grep?
>
> >> While doing a simple grep:
> >> grep '[^ "=]*/' input                  (input contains 156.000 a in
> >> one row)
> >> doesn't even take a second.
>
> >> Is this a bug in python?
>
> > You might want to look at Plex.
> >http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/
>
> > "Another advantage of Plex is that it compiles all of the regular
> > expressions into a single DFA. Once that's done, the input can be
> > processed in a time proportional to the number of characters to be
> > scanned, and independent of the number or complexity of the regular
> > expressions. Python's existing regular expression matchers do not have
> > this property. "
>
> > I haven't tested this, but I think it would do what you want:
>
> > from Plex import *
> > lexicon = Lexicon([
> >     (Rep(AnyBut(' "='))+Str('/'),  TEXT),
> >     (AnyBut('\n'), IGNORE),
> > ])
> > filename = "my_file.txt"
> > f = open(filename, "r")
> > scanner = Scanner(lexicon, f, filename)
> > while 1:
> >     token = scanner.read()
> >     print token
> >     if token[0] is None:
> >         break
>
> Hmm, unfortunately it's still orders of magnitude slower than grep in my
> own application that involves matching lots of strings and regexps
> against large files (I killed it after 400 seconds, compared to 1.5 for
> grep), and that's leaving aside the much longer compilation time (over a
> minute).  If the matching was fast then I could possibly pickle the
> lexer though (but it's not).
>

Can you give us some examples of the kinds of patterns that you are
using in practice and are slow using Python re? How large is "large"?
What kind of text?

Instead of grep, you might like to try nrgrep ... google("nrgrep
Navarro Raffinot"): PDF paper about it on Citeseer (if it's up),
postscript paper and C source findable from Gonzalo Navarro's home-
page.

Cheers,
John



More information about the Python-list mailing list