High memory usage - program mistake or Python feature?

Fri May 23 08:50:16 EDT 2003

On Fri, May 23, 2003 at 01:27:46PM +0100, Ben S wrote:
> I wrote a little CGI script that reads in a file like so:
> 
> def LoadLogFile(filename):
>     """Loads a log file as a collection of lines"""
>     try:
>         logFile = file(filename, 'rU')
>         lines = map(string.strip, logFile.readlines())
>     except IOError:
>         return False
>     return lines
> 
> Then it processes it with this function a few times:
> 
> def GetLinesContainingCommand(lines, commandName):
>     """Find all the lines containing that command in the logs"""
>     pattern = re.compile(" Log \w+: " + commandName + " ")
>     return [eachLine for eachLine in lines if pattern.search(eachLine)]
> 
> The 'problem' was that, when operating on a 50MB file, the memory usage
> (according to ps on Linux) rocketed to just over 150MB. Since there's no

Well, you are definitely keeping at least one copy of the whole file around,
plus a little per-line overhead.

You could check the lines as you read them, and only store a copy of the
ones you want to keep

def do_everything(filename):
  try:
    fob = open('filename', 'rU')
  except IOError:
    return False

  cmds = ('ssh', 'adduser', 'top', 'whatever') # commands we care about

  # { 'command' : compiled_re }
  cmd_res = dict(zip(cmds, map(re.compile, cmds)))

  # { 'command' : [list of lines that match] }
  cmds_matched = dict(zip(cmds, [[] for x in cmds])) # as obfu as python gets

  for (line) in fob.readlines():
    for (cmd) in cmds:
      if (cmd_res[cmd].search(line)):
        cmds_matched[cmd].append(line)
        break # from your example, no two commands can match the same line

  return cmds_matched

-jackdied