<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE></TITLE>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2900.3492" name=GENERATOR></HEAD>
<BODY><!-- Converted from text/plain format -->
<P><FONT size=2><FONT face=Arial><FONT color=#0000ff>Hi
Terry,<BR></FONT><BR>-----Original Message-----<BR>From: Terry Reedy [</FONT><A
href="mailto:tjreedy@udel.edu"><FONT
face=Arial>mailto:tjreedy@udel.edu</FONT></A><FONT face=Arial>]<BR>Sent:
Wednesday, January 14, 2009 01:57<BR>To: python-list@python.org<BR>Subject: Re:
Could you suggest optimisations ?<BR><BR>Barak, Ron wrote:<BR>>
Hi,<BR>><BR>> In the attached script, the longest time is spent in the
following<BR>> functions (verified by psyco log):<BR><BR>I cannot help but
wonder why and if you really need all the rigamorole with file pointers,
offsets, and tells instead of<BR><BR>for line in open(...):<BR> do
your processing.</FONT></FONT></P>
<P><FONT size=2><FONT face=Arial><FONT color=#0000ff>I'm building a database of
the found events in the logs (those records between the first and last regexs in
regex_array).<BR>The user should then be able to navigate among these events
(among other functionality).<BR>This is why I need the tells and offsets, so I'd
know the place in the logs where an event starts/ends.</FONT></FONT></FONT></P>
<P><FONT face=Arial color=#0000ff size=2>Bye,<BR>Ron.</FONT><FONT
face=Arial><BR><BR><FONT size=2>><BR>> def
match_generator(self,regex):<BR>>
"""<BR>> Generate the next
line of self.input_file
that<BR>> matches
regex.<BR>>
"""<BR>> generator_ =
self.line_generator()<BR>>
while
True:<BR>>
self.file_pointer =
self.input_file.tell()<BR>>
if self.file_pointer !=
0:<BR>>
self.file_pointer -=
1<BR>>
if (self.file_pointer + 2) >=
self.last_line_offset:<BR>>
break<BR>>
line_ =
generator_.next()<BR>>
print "%.2f%% \r" % (((self.last_line_offset -<BR>>
self.input_file.tell()) / (self.last_line_offset * 1.0)) *
100.0),<BR>>
if not
line_:<BR>>
break<BR>>
else:<BR>>
match_ =
regex.match(line_)<BR>>
groups_ =
re.findall(regex,line_)<BR>>
if
match_:<BR>>
yield line_.strip("\n"), groups_<BR>> <BR>>
def
get_matching_records_by_regex_extremes(self,regex_array):<BR>>
"""<BR>> Function
will:<BR>> Find the record
matching the first item of
regex_array.<BR>> Will save
all records until the last item of
regex_array.<BR>> Will save
the last line.<BR>> Will
remember the position of the beginning of the next line
in<BR>>
self.input_file.<BR>>
"""<BR>> start_regex =
regex_array[0]<BR>> end_regex
= regex_array[len(regex_array) -
1]<BR>> <BR>>
all_recs = []<BR>> generator_
=
self.match_generator<BR>> <BR>>
try:<BR>>
match_start,groups_ =
generator_(start_regex).next()<BR>>
except
StopIteration:<BR>>
return(None)<BR>> <BR>>
if match_start !=
None:<BR>>
all_recs.append([match_start,groups_])<BR>> <BR>>
line_ =
self.line_generator().next()<BR>>
while
line_:<BR>>
match_ =
end_regex.match(line_)<BR>>
groups_ =
re.findall(end_regex,line_)<BR>>
if match_ !=
None:<BR>>
all_recs.append([line_,groups_])<BR>>
return(all_recs)<BR>>
else:<BR>>
all_recs.append([line_,[]])<BR>>
line_ =
self.line_generator().next()<BR>> <BR>> def
line_generator(self):<BR>>
"""<BR>> Generate the next
line of self.input_file, and
update<BR>> self.file_pointer
to the beginning of that
line.<BR>>
"""<BR>> while
self.input_file.tell() <=
self.last_line_offset:<BR>>
self.file_pointer =
self.input_file.tell()<BR>>
line_ =
self.input_file.readline()<BR>>
if not
line_:<BR>>
break<BR>>
yield line_.strip("\n")<BR>><BR>> I was trying to think of optimisations,
so I could cut down on<BR>> processing time, but got no inspiration.<BR>>
(I need the "print "%.2f%% \r" ..." line for user's
feedback).<BR>><BR>> Could you suggest any optimisations ?<BR>>
Thanks,<BR>> Ron.<BR>> <BR>> <BR>> P.S.: Examples of
processing times
are:<BR>><BR>> *
2m42.782s on two files with combined size of 792544
bytes<BR>> (no
matches found).<BR>> *
28m39.497s on two files with combined size of 4139320
bytes<BR>> (783
matches found).<BR>><BR>> These times are quite
unacceptable, as a normal input to the program<BR>>
would be ten files with combined size of ~17MB.<BR>><BR>><BR>>
----------------------------------------------------------------------<BR>>
--<BR>><BR>> --<BR>> </FONT></FONT><A
href="http://mail.python.org/mailman/listinfo/python-list"><FONT face=Arial
size=2>http://mail.python.org/mailman/listinfo/python-list</FONT></A><BR><BR><BR></P></BODY></HTML>