Could you suggest optimisations ?

Tue Jan 13 10:51:54 EST 2009

Hi,

In the attached script, the longest time is spent in the following functions (verified by psyco log):

    def match_generator(self,regex):
        """
        Generate the next line of self.input_file that
        matches regex.
        """
        generator_ = self.line_generator()
        while True:
            self.file_pointer = self.input_file.tell()
            if self.file_pointer != 0:
                self.file_pointer -= 1
            if (self.file_pointer + 2) >= self.last_line_offset:
                break
            line_ = generator_.next()
            print "%.2f%%   \r" % (((self.last_line_offset - self.input_file.tell()) / (self.last_line_offset * 1.0)) * 100.0),
            if not line_:
                break
            else:
                match_ = regex.match(line_)
                groups_ = re.findall(regex,line_)
                if match_:
                    yield line_.strip("\n"), groups_

    def get_matching_records_by_regex_extremes(self,regex_array):
        """
        Function will:
        Find the record matching the first item of regex_array.
        Will save all records until the last item of regex_array.
        Will save the last line.
        Will remember the position of the beginning of the next line in
        self.input_file.
        """
        start_regex = regex_array[0]
        end_regex = regex_array[len(regex_array) - 1]

        all_recs = []
        generator_ = self.match_generator

        try:
            match_start,groups_ = generator_(start_regex).next()
        except StopIteration:
            return(None)

        if match_start != None:
            all_recs.append([match_start,groups_])

            line_ = self.line_generator().next()
            while line_:
                match_ = end_regex.match(line_)
                groups_ = re.findall(end_regex,line_)
                if match_ != None:
                    all_recs.append([line_,groups_])
                    return(all_recs)
                else:
                    all_recs.append([line_,[]])
                    line_ = self.line_generator().next()

    def line_generator(self):
        """
        Generate the next line of self.input_file, and update
        self.file_pointer to the beginning of that line.
        """
        while self.input_file.tell() <= self.last_line_offset:
            self.file_pointer = self.input_file.tell()
            line_ = self.input_file.readline()
            if not line_:
                break
            yield line_.strip("\n")

I was trying to think of optimisations, so I could cut down on processing time, but got no inspiration.
(I need the "print "%.2f%%   \r" ..." line for user's feedback).

Could you suggest any optimisations ?
Thanks,
Ron.

P.S.: Examples of processing times are:

 *   2m42.782s  on two files with combined size of    792544 bytes (no matches found).
 *   28m39.497s on two files with combined size of 4139320 bytes (783 matches found).

These times are quite unacceptable, as a normal input to the program would be ten files with combined size of ~17MB.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20090113/692ab02b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: _failover_multiple_files_client.py
Type: application/octet-stream
Size: 7771 bytes
Desc: _failover_multiple_files_client.py
URL: <http://mail.python.org/pipermail/python-list/attachments/20090113/692ab02b/attachment.obj>