[Tutor] performance considerations

dman dsh8290@rit.edu
Mon, 3 Dec 2001 20:19:24 -0500


On Mon, Dec 03, 2001 at 01:37:50PM -0500, Andrei Kulakov wrote:
| On Mon, Dec 03, 2001 at 01:31:13PM -0500, dman wrote:
| > On Mon, Dec 03, 2001 at 01:10:41PM -0500, Andrei Kulakov wrote:
| > | On Mon, Dec 03, 2001 at 12:51:39PM -0500, dman wrote:
| > | > 
| > | > I'm working on a script/program to do some processing (munging) on
| > | > text files in a directory tree.  In addition, for convenience, all
| > | > non-interesting files are copied without munging.  My script, as it
| > | > stands, is quite CPU intensive so I would like to optimize it some.
| > | > 
| > | > A portion of the script generates strings by starting with 'a' and
| > | > "adding" to it.  Ie "a", "b", ..., "z", "aa", "ab", ..., "zz", "aaa".
| > | > Would it be better to use a list of one-char-strings than to modify a
| > | > single string?  Here's the code I have now (BTW, that funny-looking
| > 
| > I forgot to mention, if a list of strings is used, each yield will
| > yield  "".join( the_list ), so the comparison is the multiple
| > modifications (creation) of strings versus join.
| 
| Yeah, that's what I thought.. I think I remember someone saying that one
| join would be much faster.

I did check out the 'profile' module before I left work, and it is
really easy to profile the code!  I'll test this tomorrow.

| > | > One last question for now :
| > | > I traverse the interesting files line-by-line and check them for a
| > | > regex match, then modify the line if it matches properly.  Would it be
| > | > better (faster) to read in the whole file and treat it as a single
| > | > string?  Memory is not a problem.
| > | 
| > | Yeah, probably.. profile!
| > 
| > I want to speculate before I rewrite it :-).  Maybe Tim will tell me
| > something (since he is so familiar with the inner workings)?

About half the time was spent in the function that iterates over the
lines checking it against the regex and modifying it appropriately.
The other half was spent checking lines for adjacent duplicates and
eliminating them.

The biggest problem with my regex, was it was inherently line-based.
It looked something like :

    (.*)(interesting stuff)(.*)

then I would take "interesting stuff" and break it down to the part
I'm really interested in, change it and put it all back together.

Then I had a "duh" moment :  all I need to do is find "interesting
stuff", play with it, then use string.replace to put the new stuff in
the string.

Now I'm wondering if the overhead of string creation with replace() is
better, and if it would be faster to just iterate over all the strings
I want to replace and try replacing them whether or not they exist in
the current file.

Is there any mutable strings in python?  I imagine that could improve
performance.

In any event, the thing only takes 28s to run.  It seems much longer,
due to various settings in my environment.

-D

-- 

Failure is not an option.  It is bundled with the software.