[Tutor] open/closing files and system limits

Michael P. Reilly arcege@shore.net
Wed, 13 Sep 2000 17:39:54 -0400 (EDT)


> 
> Hello all:
> 
> I've got a small python script that rapidly opens a file, reads the lines
> and closes the file.  This procedure is in a for loop.
> 
> for file in catalogoffilestoprocess.readlines():
>     currentfile = open(file[:-1]) # Only way I could figure to strip
>                                     newline from filename
>     for line in currentfile.readlines():
>         if blah
>             blah
>         elif blah
>             blah
>     currentfile.close()
> 
> 
> Well that works alright except for this.  My list of files contains about
> 1100 files to process.  It takes an extraordinary amount of time to
> run.  Watching it work, I can see that it rushes through a couple hundred,
> stops for several (i.e. 1-4) minutes, then continues.
> 
> I believe it has something to do with the default file limits set in my
> kernel (Linux).  I was thinking that the system wasn't keeping track of
> the fact that the files were closed?  Somehow hitting that system ceiling?
> 
> Also running "time ./script catalogoffiles"
> returns (look at the time elapsed!  also just now noticed the pagefault
> and swap info, but am not familiar with time's output or what this
> indicates).
> 
> 463.40user 8.89system 8:05.39elapsed 97%CPU (0avgtext+0avgdata 
> 0maxresident)k
> 0inputs+0outputs (11017major+100694minor)pagefaults 1556swaps
> 

I think this is in the FAQ somewhere as a performance problem.  There
are a few solutions, but my "favorite" is:
  for lines in catalog.readlines(8192):  # get a "block" of lines
    for line in lines:
      file = open(line[:-1])
      ...

This reads a disk block, breaks it into lines (leaving the left over
for the next read), and returns those lines; then you can iterate
through that set of lines, until the outer loop returns no lines left.

You can think of the break down as:
  Block  0  line 0
            line 1
            line 2
            line 3
            line 4
            line 5
  Block 1   line 6
            line 7
            long line 8
            part of line 9
  Block 2   rest of line 9 (returned in third call to catalog.readlines)
            line 10
            line 11
            line 12
            line 13
  Block 3   line 14

I hope this helps.

  -Arcege

-- 
------------------------------------------------------------------------
| Michael P. Reilly, Release Manager  | Email: arcege@shore.net        |
| Salem, Mass. USA  01970             |                                |
------------------------------------------------------------------------