Scanning a file

Sat Oct 29 15:15:46 EDT 2005

"Mike Meyer" <mwm at mired.org> wrote in message 
news:864q70evci.fsf at bhuda.mired.org...
> "Paul Watson" <pwatson at redlinepy.com> writes:
...
> Did you do timings on it vs. mmap? Having to copy the data multiple
> times to deal with the overlap - thanks to strings being immutable -
> would seem to be a lose, and makes me wonder how it could be faster
> than mmap in general.

The only thing copied is a string one byte less than the search string for 
each block.

I did not do due dilligence with respect to timings.  Here is a small 
dataset read sequentially and using mmap.

$ ls -lgG t.dat
-rw-r--r--  1 16777216 Oct 28 16:32 t.dat
$ time  ./scanfile.py
1048576
    0.80s real     0.64s user     0.15s system
$ time  ./scanfilemmap.py
1048576
   20.33s real     6.09s user    14.24s system

With a larger file, the system time skyrockets. I assume that to be the 
paging mechanism in the OS.  This is Cyngwin on Windows XP.

$ ls -lgG t2.dat
-rw-r--r--  1 268435456 Oct 28 16:33 t2.dat
$ time  ./scanfile.py
16777216
   28.85s real    16.37s user     0.93s system
$ time  ./scanfilemmap.py
16777216
  323.45s real    94.45s user   227.74s system