File Read Cache - How to purge?

Tue Aug 21 00:06:43 EDT 2007

To become part of a larger script that will read through all files on
a given drive, I was playing around with reading files and wanted to
see if there was an optimum value for a read size on my system.

What I noticed is that the file being read is "cached" on subsequent
reads.
Based on some testing it looks like it's by the underlying OS (windows
in this case) but have a few questions.

Here's a code sample:
-------------------------------------------------------------------------
import os, time

# Set the following two variables to
# different large files on your system.
# Suggest files in the range of 500MB to 1Gig
testfile1 = "d:\\test1\\junk1.file"
testfile2 = "d:\\test1\\junk2.file"

def readfile(filename):
     size = os.path.getsize(filename)
     bufsize = 4096
     print filename, size, "Bytes"

     while bufsize < 132000:
         start = time.clock()

         f = open(filename, "rb")
         buf = f.read(bufsize)
         while buf:
             buf = f.read(bufsize)
         f.flush()     # note: put here as a test and
                        # it doesn't make a difference
         f.close()

         end = time.clock()
         print bufsize, round(end - start,3)
         bufsize = bufsize*2

      print " "

# Comment the second and third readfile and run
# the program twice to see a similar result for testfile1
readfile(testfile1)
readfile(testfile1)
readfile(testfile2)
-----------------------------------------------------------------

Sample output for first testfile1:
d:\test1\junk1.file 759167228 Bytes
4096 20.366
8192 0.923
16384 0.783
32768 0.737
65536 0.74
131072 0.82

After the first read test at 4096, subsequent read tests seem to be
cached.
This is even though the file is being closed before initiating another
read test.

Sample output for second testfile1
d:\test1\junk1.file 759167228 Bytes
4096 1.258
8192 0.944
16384 0.795
32768 0.743
65536 0.725
131072 0.826

Ok, didn't expect much difference here based on the first read, but
wanted to note how 4096 is now 1.2 seconds.

Sample output for testfile2:
d:\test1\junk2.file 1142511616 Bytes
4096 31.514
8192 1.417
16384 1.202
32768 1.11
65536 1.089
131072 1.245

Same situation as our first sample for testfile1. 4096 is not cached,
but subsequent reads are.

Now some things to note:

So it seems the file is being cached, however on my system only ~2MB
of additional memory is used when the program is run. This 2MB of
memory is released when the script exits.

If you comment the second and third readfile lines (as noted in the
code):

a. Run the program twice, you will see that even if the program exits,
this cache is not cleared.

b. If you open another command prompt and run the code, it's cached.

c. If you close both command prompts, open a new one and run the code
it's still cached.

It isn't "cleared" until another large file is read.

My questions are:

1. I don't quite understand how after one full read of a file, another
full read of the same file is "cached" so significantly while
consuming so little memory. What exactly is being cached to improve
the reading of the file a second time?

2. Is there anyway to somehow to take advantage of this "caching" by
initializing it without reading through the entire file first?

3. If the answer to #2 is No, then is there a way to purge this
"cache" in order to get a more accurate result in my routine?  That is
without having to read another large file first?