search speed

Tim Chase tim at thechases.com
Fri Jan 30 23:12:03 EST 2009


> I have written a Python program that serach for specifik customer in
> files (around 1000 files)
> the trigger is LF01 + CUSTOMERNO

While most of the solutions folks have offered involve scanning 
all the files each time you search, if the content of those files 
doesn't change much, you can build an index once and then query 
the resulting index multiple times.  Because I was bored, I threw 
together the code below (after the "-------" divider) which does 
what you detail as best I understand, allowing you to do

   python tkc.py 31415

to find the files containing CUSTOMERNO=31415  The first time, 
it's slow because it needs to create the index file.  However, 
subsequent runs should be pretty speedy.  You can also specify 
multiple customers on the command-line:

   python tkc.py 31415 1414 77777

and it will search for each of them.  I presume they're found by 
the regexp "LF01(\d+)" based on your description, that the file 
can be sensibly broken into lines, and the code allows for 
multiple results on the same line.  Adjust accordingly if that's 
not the pattern you want or the conditions you expect.

If your source files change, you can reinitialize the database with

   python tkc.py -i

You can also change the glob pattern used for indexing -- by 
default, I assumed they were "*.txt".  But you can either 
override the default with

   python tkc.py -i -p "*.dat"

or you can change the source to default differently (or even skip 
the glob-check completely...look for the fnmatch() call).  There 
are a few more options.  Just use

   python tkc.py --help

as usual.  It's also a simple demo of the optparse module if 
you've never used it.

Enjoy!

-tkc

PS:  as an aside, how do I import just the fnmatch function?  I 
tried both of the following and neither worked:

   from glob.fnmatch import fnmatch
   from glob import fnmatch.fnmatch

I finally resorted to the contortion coded below in favor of
   import glob
   fnmatch = glob.fnmatch.fnmatch

-----------------------------------------------------------------


#!/usr/bin/env python
import dbm
import os
import re
from glob import fnmatch
fnmatch = fnmatch.fnmatch
from optparse import OptionParser

customer_re = re.compile(r"LF01(\d+)")

def build_parser():
   parser = OptionParser(
     usage="%prog [options] [cust#1 [cust#2 ... ]]"
     )
   parser.add_option("-i", "--index", "--reindex",
     action="store_true",
     dest="reindex",
     default=False,
     help="Reindex files found in the current directory "
       "in the event any files have changed",
     )
   parser.add_option("-p", "--pattern",
     action="store",
     dest="pattern",
     default="*.txt",
     metavar="GLOB_PATTERN",
     help="Index files matching GLOB_PATTERN",
     )
   parser.add_option("-d", "--db", "--database",
     action="store",
     dest="indexfile",
     default=".index",
     metavar="FILE",
     help="Use the index stored at FILE",
     )
   parser.add_option("-v", "--verbose",
     action="count",
     dest="verbose",
     default=0,
     help="Increase verbosity"
     )
   return parser

def reindex(options, db):
   if options.verbose: print "Indexing..."
   for path, dirs, files in os.walk('.'):
     for fname in files:
       if fname == options.indexfile:
         # ignore our database file
         continue
       if not fnmatch(fname, options.pattern):
         # ensure that it matches our pattern
         continue
       fullname = os.path.join(path, fname)
       if options.verbose: print fullname
       f = file(fullname)
       found_so_far = set()
       for line in f:
         for customer_number in customer_re.findall(line):
           if customer_number in found_so_far: continue
           found_so_far.add(customer_number)
           try:
             val = '\n'.join([
               db[customer_number],
               fullname,
               ])
             if options.verbose > 1:
               print "Appending %s" % customer_number
           except KeyError:
             if options.verbose > 1:
               print "Creating %s" % customer_number
             val = fullname
           db[customer_number] = val
       f.close()

if __name__ == "__main__":
   parser = build_parser()
   opt, args = parser.parse_args()
   reindexed = False
   if opt.reindex or not os.path.exists("%s.db" % opt.indexfile):
     db = dbm.open(opt.indexfile, 'n')
     reindex(opt, db)
     reindexed = True
   else:
     db = dbm.open(opt.indexfile, 'r')
   if not (args or reindexed):
     parser.print_help()
   for arg in args:
     print "%s:" % arg,
     try:
       val = db[arg]
       print
       for item in val.splitlines():
         print " %s" % item
     except KeyError:
       print "Not found"
   db.close()





More information about the Python-list mailing list