Search for a string in binary files

François Pinard pinard at
Tue Jul 22 23:07:06 CEST 2003


> One last question: does grep actually open files when it searches them?

I did not look at `grep' sources for a good while, I might not remember
correctly, read me with caution.  `grep' might be trying to `mmap' the files
if the file (and the underlying system) allows this, and there is system
overhead associated with that function, just like `open'.

> And, would it be more efficent (faster) to just call grep from python to
> do the searching?

No doubt to me that it is more efficient calling `grep' _instead_ of Python.
However, if Python is already started, it is more efficient doing the work
from within Python than launching an external program as `grep', as there is
non-negligible system overhead in doing so.  (Yet for only a few files,
launching `grep' is fast enough that the user would not notice it anyway.)

Still, there are special cases, unusual in practice, when `grep' might be
faster despite the overhead of calling it.  When the file is long enough,
and the string to be searched for meets some special conditions, the
Buyer-Moore algorithm (not sure of spelling) might progressively beat the
likely more simple-minded search technique used within `string.find'.  Yet
if Python's `string.find' relies on `strstr' in GNU `libc', it might be
quite fast already.  The implementation of such basic routines in `libc'
varied over time, they at least once used to be extremely well implemented
for speed, cleverly using bits of assembler here and there.  For `strstr' in
particular, there was once some good code from Stephen van den Berg.  I do
not know what `libc' uses nowadays, nor if Python takes advantage of it.

Finally, for huge files, proper reading in Python has to be done in chunks,
and the string to be searched for may happen to span chunks.  Doing it
properly might require some more care than one might think at first.  But in
practice, on the big average, for reasonable files, staying in Python wins.

François Pinard

More information about the Python-list mailing list