[Tutor] Speeding up file processing?
A.M. Kuchling
amk at amk.ca
Tue Nov 11 21:29:46 EST 2003
On Tue, Nov 11, 2003 at 06:58:45PM -0500, Clay Shirky wrote:
> for line in f:
> if re.match('^shirky.com', line): # find hits from my site
> fields = line.split()
> try: referer = fields[11] # grab the referer
> except: continue # continue if there is a mangled line
> referer = re.sub('"', '', referer)
> if re.search("shirky", referer): continue # ignore internal links
> if re.search("-", referer): continue # ...and email clicks
> referer = re.sub("www.", "", referer)
> print referer
While this code is using the re module, it's not doing anything that can't
be done with string operations; all of the things being searched for are
fixed strings such as 'shirky.com', not patterns such as \w+[.](com|net).
An untested rewrite:
for line in f:
if line.startswith('shirky.com'):
fields = line.split()
try: referer = fields[11] # grab the referer
except: continue # continue if there is a mangled line
referer = referer.replace('"', '')
if referer.find("shirky") == -1: continue # ignore internal links
if '-' in referer: continue # ...and email clicks
referer = referer.replace('www.', "")
print referer
In Python 2.3, referer.find("shirky") == -1 can be replaced with the more
readable "if ('shirky' in referer): ...".
Perl avoids using the C stdio library for the sake of speed, using the
internals of the FILE structure instead, while Python sticks to strict ANSI
C. This results in a certain speed penalty, but I don't know how much; you
could try just running the two loops with a 'pass' in the body to compare
the I/O overhead.
--amk
More information about the Tutor
mailing list