[Tutor] question about run time

Tue May 2 23:49:49 CEST 2006

Hi John,

You can try something like the profiler, which will say where most of the 
program's time is being spent.  We can find documentation on the Python 
profiler here:

     http://www.python.org/doc/lib/profile.html

>From a rough, low-level standpoint, there are tools like 'top' on Linux 
that let you see if a program is idling.  Another low-level program --- 
strace --- allows one to watch for system calls, and can be very useful 
for understanding the low-level performance of a program.

     http://www.liacs.nl/~wichert/strace/

I'm using Solaris on one of my systems, and it comes with a marvelous tool 
called 'dtrace':

     http://www.sun.com/bigadmin/content/dtrace/

So there are good tools for measuring performance from both a high-level 
and a low-level perspective, and we often need to jump betweeen these 
levels to understand program performance.

Let's do some code review.

> It checks some text files for a user name and collects memory and inode 
> usage then adds them together and checks against a set limit...if the 
> limit is reached it calls a mail script to send an email warning.

>From a cursory look at your program, I see one place which seems to be the 
tight inner loop of your program, in extractUserData().

#####################################################
     def extractUserData(self):
         print self.filePath
         fullList = open(self.filePath,"r").readlines()
         for line in fullList:
             #print "line", line
             singleList = line.split()
             try:
                 if singleList[1] == self.userName:
                    print line
                    return singleList[2],singleList[3]
             except:
                 pass
         return 0,0
#####################################################

This function is called in another loop in your main program, and it 
itself does lots of looping, so let's spend some time looking at this: I 
believe this will be worthwhile.

One small improvement you might want to make here is to avoid reading in 
the whole file at once.  That is, rather than:

     lines = open(filename).readlines()
     for line in lines:
         ...

it's often better to do:

     myfile = open(filename)
     for line in myfile:
         ...

This is a relatively minor detail, and a low-level one.

But a bigger payoff can occur if we take a higher-level look at what's 
happening.  From a high level, the premise of the program is that there's 
a set of text files.  For any particular user, some auxiliary information 
(inode and memory usage.) is being stored in these files.

This is really crying out to be a database.  *grin* I don't know how much 
freedom you have to change things around, but if you can use a database to 
centralize all this information, that will be a very good thing.

If we really must keep things this way, I'd strongly recommend that we 
reconsider doing all the file opening/reading/scanning in the inner loop. 
I suspect that doing all that file opening and linear scanning in the 
inner loop is what strongly influences the program's performance.  This is 
certainly I/O bound, and we want to get I/O out of tight loops like this.

Instead, we can do some preprocessing work up front.  If we read all the 
records at the very beginning and store those records in an in-memory 
dictionary, then extractUserData() can be a very simple lookup rather than 
a filesystem-wide hunt.

It's the conceptual difference between:

########################################################
## Pseudocode; in reality, we'd strip line endings too 
########################################################
while True:
     word = raw_input("enter a word")
     for other in open('/usr/share/dict/words'):
         if other == word:
             print "It's in"
             break
########################################################

vs:

#########################################################
all_words = {}
for word in open('/usr/share/dict/words'):
     all_words[word] = True
while True:
     word = raw_input("enter a word")
     if word in all_words:
         print "It's in"
#########################################################

The former opens and scans the file for each input we get.  The latter 
does a bit of work up front, but it makes up for it if we go through the 
inner loop more than once.

If we were to do a scan for words just once, the former might be 
preferable since it might not need to read the whole file to give an 
answer.  But if we're going to do the scan for several people, the latter 
is probably the way to go.

Good luck to you!