Could you suggest optimisations ?
Barak, Ron
Ron.Barak at lsi.com
Tue Jan 13 10:51:54 EST 2009
Hi,
In the attached script, the longest time is spent in the following functions (verified by psyco log):
def match_generator(self,regex):
"""
Generate the next line of self.input_file that
matches regex.
"""
generator_ = self.line_generator()
while True:
self.file_pointer = self.input_file.tell()
if self.file_pointer != 0:
self.file_pointer -= 1
if (self.file_pointer + 2) >= self.last_line_offset:
break
line_ = generator_.next()
print "%.2f%% \r" % (((self.last_line_offset - self.input_file.tell()) / (self.last_line_offset * 1.0)) * 100.0),
if not line_:
break
else:
match_ = regex.match(line_)
groups_ = re.findall(regex,line_)
if match_:
yield line_.strip("\n"), groups_
def get_matching_records_by_regex_extremes(self,regex_array):
"""
Function will:
Find the record matching the first item of regex_array.
Will save all records until the last item of regex_array.
Will save the last line.
Will remember the position of the beginning of the next line in
self.input_file.
"""
start_regex = regex_array[0]
end_regex = regex_array[len(regex_array) - 1]
all_recs = []
generator_ = self.match_generator
try:
match_start,groups_ = generator_(start_regex).next()
except StopIteration:
return(None)
if match_start != None:
all_recs.append([match_start,groups_])
line_ = self.line_generator().next()
while line_:
match_ = end_regex.match(line_)
groups_ = re.findall(end_regex,line_)
if match_ != None:
all_recs.append([line_,groups_])
return(all_recs)
else:
all_recs.append([line_,[]])
line_ = self.line_generator().next()
def line_generator(self):
"""
Generate the next line of self.input_file, and update
self.file_pointer to the beginning of that line.
"""
while self.input_file.tell() <= self.last_line_offset:
self.file_pointer = self.input_file.tell()
line_ = self.input_file.readline()
if not line_:
break
yield line_.strip("\n")
I was trying to think of optimisations, so I could cut down on processing time, but got no inspiration.
(I need the "print "%.2f%% \r" ..." line for user's feedback).
Could you suggest any optimisations ?
Thanks,
Ron.
P.S.: Examples of processing times are:
* 2m42.782s on two files with combined size of 792544 bytes (no matches found).
* 28m39.497s on two files with combined size of 4139320 bytes (783 matches found).
These times are quite unacceptable, as a normal input to the program would be ten files with combined size of ~17MB.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20090113/692ab02b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: _failover_multiple_files_client.py
Type: application/octet-stream
Size: 7771 bytes
Desc: _failover_multiple_files_client.py
URL: <http://mail.python.org/pipermail/python-list/attachments/20090113/692ab02b/attachment.obj>
More information about the Python-list
mailing list