Generator Expressions and CSV

Fri Jul 17 18:39:33 EDT 2009

On Jul 17, 5:31 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
> Zaki wrote:
> > On Jul 17, 2:49 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
> >> Zaki wrote:
> >>> Hey all,
> >>> I'm really new to Python and this may seem like a really dumb
> >>> question, but basically, I wrote a script to do the following, however
> >>> the processing time/memory usage is not what I'd like it to be. Any
> >>> suggestions?
> >>> Outline:
> >>> 1. Read tab delim files from a directory, files are of 3 types:
> >>> install, update, and q. All 3 types contain ID values that are the
> >>> only part of interest.
> >>> 2. Using set() and set.add(), generate a list of unique IDs from
> >>> install and update files.
> >>> 3. Using the set created in (2), check the q files to see if there are
> >>> matches for IDs. Keep all matches, and add any non matches (which only
> >>> occur once in the q file) to a queue of lines to be removed from teh q
> >>> files.
> >>> 4. Remove the lines in the q for each file. (I haven't quite written
> >>> the code for this, but I was going to implement this using csv.writer
> >>> and rewriting all the lines in the file except for the ones in the
> >>> removal queue).
> >>> Now, I've tried running this and it takes much longer than I'd like. I
> >>> was wondering if there might be a better way to do things (I thought
> >>> generator expressions might be a good way to attack this problem, as
> >>> you could generate the set, and then check to see if there's a match,
> >>> and write each line that way).
> >> Why are you checking and removing lines in 2 steps? Why not copy the
> >> matching lines to a new q file and then replace the old file with the
> >> new one (or, maybe, delete the new q file if no lines were removed)?
>
> > That's what I've done now.
>
> > Here is the final code that I have running. It's very much 'hack' type
> > code and not at all efficient or optimized and any help in optimizing
> > it would be greatly appreciated.
>
> > import csv
> > import sys
> > import os
> > import time
>
> > begin = time.time()
>
> > #Check minutes elapsed
> > def timeElapsed():
> >     current = time.time()
> >     elapsed = current-begin
> >     return round(elapsed/60)
>
> > #USAGE: python logcleaner.py <input_dir> <output_dir>
>
> > inputdir = sys.argv[1]
> > outputdir = sys.argv[2]
>
> > logfilenames = os.listdir(inputdir)
>
> > IDs = set() #IDs from update and install logs
> > foundOnceInQuery = set()
> > #foundTwiceInQuery = set()
> > #IDremovalQ = set() Note: Unnecessary, duplicate of foundOnceInQuery;
> > Queue of IDs to remove from query logs (IDs found only once in query
> > logs)
>
> > #Generate Filename Queues For Install/Update Logs, Query Logs
> > iNuQ = []
> > queryQ = []
>
> > for filename in logfilenames:
> >     if filename.startswith("par1.install") or filename.startswith
> > ("par1.update"):
>
>       if filename.startswith(("par1.install", "par1.update")):
>
> >         iNuQ.append(filename)
> >     elif filename.startswith("par1.query"):
> >         queryQ.append(filename)
>
> > totalfiles = len(iNuQ) + len(queryQ)
> > print "Total # of Files to be Processed:" , totalfiles
> > print "Install/Update Logs to be processed:" , len(iNuQ)
> > print "Query logs to be processed:" , len(queryQ)
>
> > #Process install/update queue to generate list of valid IDs
> > currentfile = 1
> > for file in iNuQ:
>
>  >     print "Processing", currentfile, "install/update log out of", len
>  > (iNuQ)
>  >     print timeElapsed()
>  >     reader = csv.reader(open(inputdir+file),delimiter = '\t')
>  >     for row in reader:
>  >         IDs.add(row[2])
>  >     currentfile+=1
>
> Best not to call it 'file'; that's a built-in name.
>
> Also you could use 'enumerate', and joining filepaths is safer with
> os.path.join().
>
> for currentfile, filename in enumerate(iNuQ, start=1):
>      print "Processing", currentfile, "install/update log out of", len(iNuQ)
>      print timeElapsed()
>      current_path = os.path.join(inputdir, filename)
>      reader = csv.reader(open(current_path), delimiter = '\t')
>      for row in reader:
>          IDs.add(row[2])
>
>
>
> > print "Finished processing install/update logs"
> > print "Unique IDs found:" , len(IDs)
> > print "Total Time Elapsed:", timeElapsed()
>
> > currentfile = 1
> > for file in queryQ:
>
> Similar remarks to above ...
>
> >     print "Processing", currentfile, "query log out of", len(queryQ)
> >     print timeElapsed()
> >     reader = csv.reader(open(inputdir+file), delimiter = '\t')
> >     outputfile = csv.writer(open(outputdir+file), 'w')
>
> ... and also here.
>
> >     for row in reader:
> >         if row[2] in IDs:
> >             ouputfile.writerow(row)
>
> Should be 'outputfile'.
>
> >         else:
> >             if row[2] in foundOnceInQuery:
> >                 foundOnceInQuery.remove(row[2])
>
> You're removing the ID here ...
>
> >                 outputfile.writerow(row)
> >                 #IDremovalQ.remove(row[2])
> >                 #foundTwiceInQuery.add(row[2])
>
> >             else:
> >                 foundOnceInQuery.add(row[2])
>
> ... and adding it again here!
>
> >                 #IDremovalQ.add(row[2])
>
> >     currentfile+=1
>
> For safety you should close the files after use.
>
> > print "Finished processing query logs and writing new files"
> > print "# of Query log entries removed:" , len(foundOnceInQuery)
> > print "Total Time Elapsed:", timeElapsed()
>
> Apart from that, it looks OK.
>
> How big are the q files? If they're not too big and most of the time
> you're not removing rows, you could put the output rows into a list and
> then create the output file only if rows have been removed, otherwise
> just copy the input file, which might be faster.

MRAB, could you please repost what I sent to you here as I meant to
post it in the main discussion.