regex over files

Thu Apr 28 01:34:51 EDT 2005

On Wed, 27 Apr 2005 21:39:45 -0500, Skip Montanaro <skip at pobox.com> wrote:

>
>    Robin> I implemented a simple scanning algorithm in two ways. First buffered scan 
>    Robin> tscan0.py; second mmapped scan tscan1.py.
>
>    ...
>
>    Robin> C:\code\reportlab\demos\gadflypaper>\tmp\tscan0.py dingo.dat
>    Robin> len=139583265 w=103 time=110.91
>
>    Robin> C:\code\reportlab\demos\gadflypaper>\tmp\tscan1.py dingo.dat
>    Robin> len=139583265 w=103 time=140.53
>
>I'm not sure why the mmap() solution is so much slower for you.  Perhaps on
>some systems files opened for reading are mmap'd under the covers.  I'm sure
>it's highly platform-dependent.  (My results on MacOSX - see below - are
>somewhat better.)
>
>Let me return to your original problem though, doing regex operations on
>files.  I modified your two scripts slightly:
>
>tscan0.py:
>
>    import sys, time, re
>    fn = sys.argv[1]
>    f=open(fn,'rb')
>    n=0
>    t0 = time.time()
>    while 1:
>         buf = f.read(4096)
>         if not buf: break
>         for i in re.split("XXXXX", buf):
To be fairer, I think you'd want to hoist the re compilation out of the loop.
But also to be fairer, maybe include the overhead of splitting correctly, at
least for the simple case regex in my example -- or is a you-goofed post
for me in the usenet forwarding queues somewhere still? ;-)

>             n += 1
>    t1 = time.time()
>
>    print "n=%d time=%.2f" % (n, (t1-t0))
>
>tscan1.py:
>
>    import sys, time, mmap, os, re
>    fn = sys.argv[1]
>    fh=os.open(fn,os.O_RDONLY)
>    s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
>    t0 = time.time()
>    n = 0
>    for mat in re.split("XXXXX", s):
>        n += 1
>    t1 = time.time()
>
>    print "n=%d time=%.2f" % (n, (t1-t0))
>
>The mmap version is almost obviously correct, assuming what we want to do is
>split the file on "XXXXX".  The buffered read version is almost certainly
>incorrect, given our understanding that corner cases lurk at buffer
>boundaries.
>I took the file from Bengt Richter's example and replicated it a bunch of
>times to get a 122MB file.  I then ran the above two programs against it:
>
>    % python tscan1.py splitX
>    n=2112001 time=8.88
>    % python tscan0.py splitX
>    n=2139845 time=10.26
>
>So the mmap'd version is within 15% of the performance of the buffered read
with regex recompilation in loop ;-)
>version and we don't have to solve the problem of any corner cases (note the
>different values of n).  I'm happy to take the extra runtime in exchange for
>simpler code.
>
Agree. Hm, I wonder if the OS notices sequential page faults and schedules speculative
read-ahead. Hm2, I wonder if you can just touch bytes from another coordinated thread
to cause that, if it isn't happening ;-) Not for 15% though ;-)

Regards,
Bengt Richter