[Tutor] Speeding up file processing?

Tue Nov 11 21:29:46 EST 2003

On Tue, Nov 11, 2003 at 06:58:45PM -0500, Clay Shirky wrote:
> for line in f:
>     if re.match('^shirky.com', line): # find hits from my site
>         fields = line.split()
>         try: referer = fields[11] # grab the referer
>         except: continue          # continue if there is a mangled line
>         referer = re.sub('"', '', referer)
>         if re.search("shirky", referer): continue # ignore internal links
>         if re.search("-", referer):      continue # ...and email clicks
>         referer = re.sub("www.", "", referer)
>         print referer

While this code is using the re module, it's not doing anything that can't
be done with string operations; all of the things being searched for are
fixed strings such as 'shirky.com', not patterns such as \w+[.](com|net). 
An untested rewrite:

for line in f:
    if line.startswith('shirky.com'):
        fields = line.split()
        try: referer = fields[11] # grab the referer
        except: continue          # continue if there is a mangled line
        referer = referer.replace('"', '')
        if referer.find("shirky") == -1: continue # ignore internal links
        if '-' in referer:      continue # ...and email clicks
        referer = referer.replace('www.', "")
        print referer

In Python 2.3, referer.find("shirky") == -1 can be replaced with the more
readable "if ('shirky' in referer): ...".

Perl avoids using the C stdio library for the sake of speed, using the
internals of the FILE structure instead, while Python sticks to strict ANSI
C.  This results in a certain speed penalty, but I don't know how much; you
could try just running the two loops with a 'pass' in the body to compare
the I/O overhead.

--amk