[Tutor] script too slow

Sun Feb 23 12:57:02 2003

My script is running so slow that it might prove usesless, and I
wondered if there was a way to speed it up.

The script will convert Microsoft RTF to XML. I had originally written
the script in perl, and now am converting it to python.

I have completed two parts of the script. The first part uses regular
expressions to break each line into tokens. 

perl => 20 seconds
python => 45 seconds

Not surprisingly, python ran slower than perl, which is designed around
regular expressions. However, the next part proved very disappointing to
me. This part reads each token, and determines if it is in a dictionary,
and takes action if it is.

perl => 4 seconds 
python => 40 seconds 

In sum, the first two steps take around 25 seconds with perl, and 1
minute and 25 seconds with python. There are at least 10 more steps, and
if python proves as slow, than the python version might take 6 to 8
minutes, whereas the perl version took around 2.

Here is the specific problem I am having with the dictionaries.

class ProcessTokens:

    # put the dictionaries here instead?

    # first make an instance
    def __init__(self, file=None,copy=None,temp_dir=None):
        self.__file=file
        self.initiate_dictionaries()

    # there are around a dozen dictionaries, and altogether 
    # these dictionaries have 250 entries
    def initiate_dictionaries(self):
        """Assign dictionariers to self. These dictionaries give information
        about the tokesn."""

        self.xml_sub = { 
        '&'                     :    '&amp;', 
        '>'                     :    '&gt;' ,
        '<'                     :    '&lt;'
        }
	# ect...

    # now use the dictionaries to process the tokens
    def process_cw(self, token, str_token, space):
        """Change the value of the control word by determing what dictionary
        it belongs to"""

        if token_changed == '*':
            pass
        elif self.needed_bool.has_key(token_changed):
            token_changed = self.needed_bool[token_changed]
        elif self.styles_1.has_key(token_changed):
            token_changed = self.styles_1[token_changed]
        elif self.styles_2.has_key(token_changed):
            token_changed = self.styles_2[token_changed]
            num = self.divide_num(num,2)

	# ect. There are around a dozen such statements

It is this last function, the "def process_cw", that eats up all the
clock time. If I skip over this function, then I chop around 30 seconds
off the script.

I am wondering if I should make the dictionaries a part of the class
property rather than a property of the actual instance? That is, put
then at the top of the class, and then access them with

ProcessTokens.needed_bool.has_key(token_changed)

?

The dictionary part of the scrpt seems so slow, that I am guessing I am
doing something wroing, that Python has to read in the dictionary each
time it starts the function.

Thanks

Paul

-- 

************************
*Paul Tremblay         *
*phthenry@earthlink.net*
************************