High memory usage - program mistake or Python feature?

Fri May 23 09:20:59 EDT 2003

It is difficult to diagnose if we can't see the whole program.
But I suppose the re module makes some (superflous) copies.
I would like to see the figures form LoadLogFile alone.

Some performance hints:
You can gain a little by using xreadlines. It does
not read the whole file ant once. But if you map
the sequence to another sequence you gain almost nothing.

IHMO you have to options:

1. Using xreadline and generators, everywhere througout the program.
This should only hold only one line main memory.
See http://www.python.org/doc/current/whatsnew/node5.html for examples.

2. Use a more lowlevel approach by memory mapping the whole
file with the mmap module. In this case you have to tweak
your regular expressions a little bit.
You can use the buffer function to get a collection of pointers into 
your memory mapped file form GetLinesContainingCommand.
This is the most efficent solution by swap space and io bandwith 
utilisation.
See http://www.python.org/doc/current/lib/built-in-funcs.html for the 
buffer function's and 
http://www.python.org/doc/current/lib/module-mmap.html for mmap's
documentation.

HTH,
Gerald

Ben S wrote:
> I wrote a little CGI script that reads in a file like so:
> 
> def LoadLogFile(filename):
>     """Loads a log file as a collection of lines"""
>     try:
>         logFile = file(filename, 'rU')
>         lines = map(string.strip, logFile.readlines())
>     except IOError:
>         return False
>     return lines
> 
> Then it processes it with this function a few times:
> 
> def GetLinesContainingCommand(lines, commandName):
>     """Find all the lines containing that command in the logs"""
>     pattern = re.compile(" Log \w+: " + commandName + " ")
>     return [eachLine for eachLine in lines if pattern.search(eachLine)]
> 
> The 'problem' was that, when operating on a 50MB file, the memory usage
> (according to ps on Linux) rocketed to just over 150MB. Since there's no
> other significant storage in the script, I can only assume that the
> lines (corresponding to strings of between 40 and 90 ASCII characters)
> are being stored in such a way that their size is inflated to 3x their
> usual size. I've not specified any Unicode usage anywhere, nor does the
> text file in question use any characters above 127, as far as I know.
> The GetLinesContainingCommand function returns a tiny subset (no more
> than 20 or 30 lines out of tens of thousands) so I doubt it's that
> causing the problem.
> 
> So I guess my question is whether I've coded this inefficiently in terms
> of memory usage, or whether this type of overhead has to be expected?
> I'm pretty new to Python so the former sounds likely. Luckily I will
> rarely be operating on 50MB files, but I'm interested in knowing for any
> future scripts I write.
> 
> --
> Ben Sizer
> http://pages.eidosnet.co.uk/kylotan
> 
>