Why is re.search() so much faster than re.sub() when there are no matches?

Wed May 16 01:46:10 EDT 2001

[News]
> I don't understand why re.sub() is so slow if no substitutions are done:
> The first loop in Active Python build 203 on Windows 2000 takes
> 1.26 seconds and the second loop takes 49.3 seconds.

Bizarre.  0.74 vs 7.2 seconds for me (Win98SE -- the king of high-performance
operating systems <wink>).

> That's a huge difference.

Well, most of it's your doing, but the rest isn't.  Read on.

> I would have thought that sub() must do a regular expression search()
> internally to see if there is anything to substitute,

Yes.

> and don't see why I can make it 39 times faster

Me neither.

> by explicitly doing the search first instead of letting re.sub() do it..

But that's not what you did below:

> import re
> line = "fsfsaf sf saf sdafsfsadf sadfdsafsadfdsafsf fdsf sf sd f s f " \
>    "sf saf safsfffff sdfsadf  f  sadf sa"
> pattern = re.compile(r"\bword\b")
> for i in range(1,100000):
>     if patern.search("line"):

Note that you're not searching the variable line here, you're searching the
4-character string "line".  So of course this loop is going to run much
faster than the next one (it's searching a much smaller string).

>         line = pattern.sub("new word", line)
> for i in range(1,100000):
>     line = pattern.sub("new word", line)

The rest of it has a deeper explanation:  pattern.search is implemented in C,
but pattern.sub is still implemented in Python.  Once you repair your first
loop to do the same thing as the second, the difference remaining is due to
the overhead of executing 100,000 Python-level functions.