[Tutor] question about run time
Danny Yoo
dyoo at hkn.eecs.berkeley.edu
Tue May 2 23:49:49 CEST 2006
Hi John,
You can try something like the profiler, which will say where most of the
program's time is being spent. We can find documentation on the Python
profiler here:
http://www.python.org/doc/lib/profile.html
>From a rough, low-level standpoint, there are tools like 'top' on Linux
that let you see if a program is idling. Another low-level program ---
strace --- allows one to watch for system calls, and can be very useful
for understanding the low-level performance of a program.
http://www.liacs.nl/~wichert/strace/
I'm using Solaris on one of my systems, and it comes with a marvelous tool
called 'dtrace':
http://www.sun.com/bigadmin/content/dtrace/
So there are good tools for measuring performance from both a high-level
and a low-level perspective, and we often need to jump betweeen these
levels to understand program performance.
Let's do some code review.
> It checks some text files for a user name and collects memory and inode
> usage then adds them together and checks against a set limit...if the
> limit is reached it calls a mail script to send an email warning.
>From a cursory look at your program, I see one place which seems to be the
tight inner loop of your program, in extractUserData().
#####################################################
def extractUserData(self):
print self.filePath
fullList = open(self.filePath,"r").readlines()
for line in fullList:
#print "line", line
singleList = line.split()
try:
if singleList[1] == self.userName:
print line
return singleList[2],singleList[3]
except:
pass
return 0,0
#####################################################
This function is called in another loop in your main program, and it
itself does lots of looping, so let's spend some time looking at this: I
believe this will be worthwhile.
One small improvement you might want to make here is to avoid reading in
the whole file at once. That is, rather than:
lines = open(filename).readlines()
for line in lines:
...
it's often better to do:
myfile = open(filename)
for line in myfile:
...
This is a relatively minor detail, and a low-level one.
But a bigger payoff can occur if we take a higher-level look at what's
happening. From a high level, the premise of the program is that there's
a set of text files. For any particular user, some auxiliary information
(inode and memory usage.) is being stored in these files.
This is really crying out to be a database. *grin* I don't know how much
freedom you have to change things around, but if you can use a database to
centralize all this information, that will be a very good thing.
If we really must keep things this way, I'd strongly recommend that we
reconsider doing all the file opening/reading/scanning in the inner loop.
I suspect that doing all that file opening and linear scanning in the
inner loop is what strongly influences the program's performance. This is
certainly I/O bound, and we want to get I/O out of tight loops like this.
Instead, we can do some preprocessing work up front. If we read all the
records at the very beginning and store those records in an in-memory
dictionary, then extractUserData() can be a very simple lookup rather than
a filesystem-wide hunt.
It's the conceptual difference between:
########################################################
## Pseudocode; in reality, we'd strip line endings too
########################################################
while True:
word = raw_input("enter a word")
for other in open('/usr/share/dict/words'):
if other == word:
print "It's in"
break
########################################################
vs:
#########################################################
all_words = {}
for word in open('/usr/share/dict/words'):
all_words[word] = True
while True:
word = raw_input("enter a word")
if word in all_words:
print "It's in"
#########################################################
The former opens and scans the file for each input we get. The latter
does a bit of work up front, but it makes up for it if we go through the
inner loop more than once.
If we were to do a scan for words just once, the former might be
preferable since it might not need to read the whole file to give an
answer. But if we're going to do the scan for several people, the latter
is probably the way to go.
Good luck to you!
More information about the Tutor
mailing list