Could you suggest optimisations ?

Tue Jan 13 18:57:22 EST 2009

Barak, Ron wrote:
> Hi,
> 
> In the attached script, the longest time is spent in the following 
> functions (verified by psyco log):

I cannot help but wonder why and if you really need all the rigamorole 
with file pointers, offsets, and tells instead of

for line in open(...):
   do your processing.

> 
>     def match_generator(self,regex):
>         """
>         Generate the next line of self.input_file that
>         matches regex.
>         """
>         generator_ = self.line_generator()
>         while True:
>             self.file_pointer = self.input_file.tell()
>             if self.file_pointer != 0:
>                 self.file_pointer -= 1
>             if (self.file_pointer + 2) >= self.last_line_offset:
>                 break
>             line_ = generator_.next()
>             print "%.2f%%   \r" % (((self.last_line_offset - 
> self.input_file.tell()) / (self.last_line_offset * 1.0)) * 100.0),
>             if not line_:
>                 break
>             else:
>                 match_ = regex.match(line_)
>                 groups_ = re.findall(regex,line_)
>                 if match_:
>                     yield line_.strip("\n"), groups_
>  
>     def get_matching_records_by_regex_extremes(self,regex_array):
>         """
>         Function will:
>         Find the record matching the first item of regex_array.
>         Will save all records until the last item of regex_array.
>         Will save the last line.
>         Will remember the position of the beginning of the next line in
>         self.input_file.
>         """
>         start_regex = regex_array[0]
>         end_regex = regex_array[len(regex_array) - 1]
>  
>         all_recs = []
>         generator_ = self.match_generator
>  
>         try:
>             match_start,groups_ = generator_(start_regex).next()
>         except StopIteration:
>             return(None)
>  
>         if match_start != None:
>             all_recs.append([match_start,groups_])
>  
>             line_ = self.line_generator().next()
>             while line_:
>                 match_ = end_regex.match(line_)
>                 groups_ = re.findall(end_regex,line_)
>                 if match_ != None:
>                     all_recs.append([line_,groups_])
>                     return(all_recs)
>                 else:
>                     all_recs.append([line_,[]])
>                     line_ = self.line_generator().next()
>  
>     def line_generator(self):
>         """
>         Generate the next line of self.input_file, and update
>         self.file_pointer to the beginning of that line.
>         """
>         while self.input_file.tell() <= self.last_line_offset:
>             self.file_pointer = self.input_file.tell()
>             line_ = self.input_file.readline()
>             if not line_:
>                 break
>             yield line_.strip("\n")
> 
> I was trying to think of optimisations, so I could cut down on 
> processing time, but got no inspiration.
> (I need the "print "%.2f%%   \r" ..." line for user's feedback).
> 
> Could you suggest any optimisations ?
> Thanks,
> Ron.
>  
>  
> P.S.: Examples of processing times are:
> 
>         * 2m42.782s  on two files with combined size of    792544 bytes
>           (no matches found).
>         * 28m39.497s on two files with combined size of 4139320 bytes
>           (783 matches found). 
> 
>     These times are quite unacceptable, as a normal input to the program
>     would be ten files with combined size of ~17MB.
> 
> 
> ------------------------------------------------------------------------
> 
> --
> http://mail.python.org/mailman/listinfo/python-list