optomizations

Steven D'Aprano steve+comp.lang.python at pearwood.info
Tue Apr 23 00:00:21 EDT 2013


On Mon, 22 Apr 2013 21:19:23 -0400, Rodrick Brown wrote:

> I would like some feedback on possible solutions to make this script run
> faster.
> The system is pegged at 100% CPU and it takes a long time to complete.

Have you profiled the app to see where it is spending all its time?

What does "a long time" mean? For instance:

"It takes two hours to process a 15KB file" -- you have a problem.

"It takes 20 minutes to process a 15GB file" -- and why are you 
complaining?


Or somewhere in the middle... 


But before profiling, I suggest you clean up the program. For example:

        if args.inputfile and os.path.exists(args.inputfile):

Don't do that. There really isn't any point in checking whether the input 
file exists, since:

1) Just because it exists doesn't mean you can read it;

2) Just because you can read it doesn't mean it is a valid gzip file;

3) Just because it is a valid gzip file that you can read *now*, doesn't 
mean that it still will be in 10 milliseconds when you actually try to 
open the file.


A lot can happen in 10ms, or 1ms. The file might be deleted, or 
overwritten, or permissions changed. Change that to:

        try:
            with gzip.open(args.inputfile) as datafile:
                for line in datafile:

and catch the exception if the file doesn't exist, or cannot be read. 
Which you already do, which just demonstrates that the call to 
os.path.exists is a waste of effort. 


Then look for wasted effort like this:

line = line.replace('cdn.xxx', 'www.xxx')
line = line.replace('cdn.xx', 'www.xx')


Surely the first line is redundant, since it would be correctly caught 
and replaced by the second?

Also, you're searching the file system *for every line* in the input 
file. Pull this outside of the loop and have it run once:

                    if not os.path.exists(outdir):
                        os.makedirs(outdir)

Likewise for opening and closing the output file, which you currently 
open and close it for every line. It only needs to be opened and closed 
once.

If it comes down to micro-optimizations to shave a few microseconds off, 
consider using string % formatting rather than the format method. But 
really, if you find yourself shaving microseconds off something that runs 
for ten minutes, you have to ask why you're bothering.



-- 
Steven



More information about the Python-list mailing list