Which is faster...find or re. ?

Hans Nowak wurmy at earthlink.net
Sat Aug 3 03:39:50 CEST 2002


Fearless Freep wrote:

> I'm doing some parsing on HTML files and lookfor for particular tags.
> 
> First off given a single line that I want to find a string in, would
> it be quicker to do
> 
> if string.find(line, searchString) > -1:
>     #process line
> 
> or 
> 
> result = re.compile (searchString).match(line)
> if result:

string.find is usually faster than regular expressions. You shouldn't really 
use regexen unless you're looking for a pattern rather than a substring.

> Now, expanding the question, which would probably be quicker.
> 
> for line in file.readlines():
>    if string.find (....
> 
> or 
> 
> fileContents = file.read()
> searchResults = re.compile (searchString).search(fileContents).
> 
> and then looping over searchResults

I don't think these two code snippets do the same, BTW. The first loops over 
all lines, and if it finds a certain string, it does something. The second 
searches all data for a certain string, and may find the first occurrence, but 
not others. You probably want re.findall here.

I think that reading the whole file and then searching the bulk is faster, 
although I don't have any hard data or benchmarks to prove it. You might want 
to write a little benchmark yourself to see which one is faster. My bet is that

   data = f.read()
   results = re.findall(pattern, data)

is faster. I guess you'd have to use the re module here since the string module 
doesn't have a findall or something similar. Or use:

   x = string.find(data, s)
   while x > -1:
       ...do something...
       x = string.find(data, s)

If you do use the regex, don't forget to compile the regex before using, it's 
much faster.

HTH,

-- 
Hans (base64.decodestring('d3VybXlAZWFydGhsaW5rLm5ldA=='))
# decode for email address ;-)
The Pythonic Quarter:: http://www.awaretek.com/nowak/




More information about the Python-list mailing list