Could you suggest optimisations ?
Terry Reedy
tjreedy at udel.edu
Tue Jan 13 18:57:22 EST 2009
Barak, Ron wrote:
> Hi,
>
> In the attached script, the longest time is spent in the following
> functions (verified by psyco log):
I cannot help but wonder why and if you really need all the rigamorole
with file pointers, offsets, and tells instead of
for line in open(...):
do your processing.
>
> def match_generator(self,regex):
> """
> Generate the next line of self.input_file that
> matches regex.
> """
> generator_ = self.line_generator()
> while True:
> self.file_pointer = self.input_file.tell()
> if self.file_pointer != 0:
> self.file_pointer -= 1
> if (self.file_pointer + 2) >= self.last_line_offset:
> break
> line_ = generator_.next()
> print "%.2f%% \r" % (((self.last_line_offset -
> self.input_file.tell()) / (self.last_line_offset * 1.0)) * 100.0),
> if not line_:
> break
> else:
> match_ = regex.match(line_)
> groups_ = re.findall(regex,line_)
> if match_:
> yield line_.strip("\n"), groups_
>
> def get_matching_records_by_regex_extremes(self,regex_array):
> """
> Function will:
> Find the record matching the first item of regex_array.
> Will save all records until the last item of regex_array.
> Will save the last line.
> Will remember the position of the beginning of the next line in
> self.input_file.
> """
> start_regex = regex_array[0]
> end_regex = regex_array[len(regex_array) - 1]
>
> all_recs = []
> generator_ = self.match_generator
>
> try:
> match_start,groups_ = generator_(start_regex).next()
> except StopIteration:
> return(None)
>
> if match_start != None:
> all_recs.append([match_start,groups_])
>
> line_ = self.line_generator().next()
> while line_:
> match_ = end_regex.match(line_)
> groups_ = re.findall(end_regex,line_)
> if match_ != None:
> all_recs.append([line_,groups_])
> return(all_recs)
> else:
> all_recs.append([line_,[]])
> line_ = self.line_generator().next()
>
> def line_generator(self):
> """
> Generate the next line of self.input_file, and update
> self.file_pointer to the beginning of that line.
> """
> while self.input_file.tell() <= self.last_line_offset:
> self.file_pointer = self.input_file.tell()
> line_ = self.input_file.readline()
> if not line_:
> break
> yield line_.strip("\n")
>
> I was trying to think of optimisations, so I could cut down on
> processing time, but got no inspiration.
> (I need the "print "%.2f%% \r" ..." line for user's feedback).
>
> Could you suggest any optimisations ?
> Thanks,
> Ron.
>
>
> P.S.: Examples of processing times are:
>
> * 2m42.782s on two files with combined size of 792544 bytes
> (no matches found).
> * 28m39.497s on two files with combined size of 4139320 bytes
> (783 matches found).
>
> These times are quite unacceptable, as a normal input to the program
> would be ten files with combined size of ~17MB.
>
>
> ------------------------------------------------------------------------
>
> --
> http://mail.python.org/mailman/listinfo/python-list
More information about the Python-list
mailing list