Generator Expressions and CSV
Zaki
zaki.rahaman at gmail.com
Fri Jul 17 18:39:33 EDT 2009
On Jul 17, 5:31 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
> Zaki wrote:
> > On Jul 17, 2:49 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
> >> Zaki wrote:
> >>> Hey all,
> >>> I'm really new to Python and this may seem like a really dumb
> >>> question, but basically, I wrote a script to do the following, however
> >>> the processing time/memory usage is not what I'd like it to be. Any
> >>> suggestions?
> >>> Outline:
> >>> 1. Read tab delim files from a directory, files are of 3 types:
> >>> install, update, and q. All 3 types contain ID values that are the
> >>> only part of interest.
> >>> 2. Using set() and set.add(), generate a list of unique IDs from
> >>> install and update files.
> >>> 3. Using the set created in (2), check the q files to see if there are
> >>> matches for IDs. Keep all matches, and add any non matches (which only
> >>> occur once in the q file) to a queue of lines to be removed from teh q
> >>> files.
> >>> 4. Remove the lines in the q for each file. (I haven't quite written
> >>> the code for this, but I was going to implement this using csv.writer
> >>> and rewriting all the lines in the file except for the ones in the
> >>> removal queue).
> >>> Now, I've tried running this and it takes much longer than I'd like. I
> >>> was wondering if there might be a better way to do things (I thought
> >>> generator expressions might be a good way to attack this problem, as
> >>> you could generate the set, and then check to see if there's a match,
> >>> and write each line that way).
> >> Why are you checking and removing lines in 2 steps? Why not copy the
> >> matching lines to a new q file and then replace the old file with the
> >> new one (or, maybe, delete the new q file if no lines were removed)?
>
> > That's what I've done now.
>
> > Here is the final code that I have running. It's very much 'hack' type
> > code and not at all efficient or optimized and any help in optimizing
> > it would be greatly appreciated.
>
> > import csv
> > import sys
> > import os
> > import time
>
> > begin = time.time()
>
> > #Check minutes elapsed
> > def timeElapsed():
> > current = time.time()
> > elapsed = current-begin
> > return round(elapsed/60)
>
> > #USAGE: python logcleaner.py <input_dir> <output_dir>
>
> > inputdir = sys.argv[1]
> > outputdir = sys.argv[2]
>
> > logfilenames = os.listdir(inputdir)
>
> > IDs = set() #IDs from update and install logs
> > foundOnceInQuery = set()
> > #foundTwiceInQuery = set()
> > #IDremovalQ = set() Note: Unnecessary, duplicate of foundOnceInQuery;
> > Queue of IDs to remove from query logs (IDs found only once in query
> > logs)
>
> > #Generate Filename Queues For Install/Update Logs, Query Logs
> > iNuQ = []
> > queryQ = []
>
> > for filename in logfilenames:
> > if filename.startswith("par1.install") or filename.startswith
> > ("par1.update"):
>
> if filename.startswith(("par1.install", "par1.update")):
>
> > iNuQ.append(filename)
> > elif filename.startswith("par1.query"):
> > queryQ.append(filename)
>
> > totalfiles = len(iNuQ) + len(queryQ)
> > print "Total # of Files to be Processed:" , totalfiles
> > print "Install/Update Logs to be processed:" , len(iNuQ)
> > print "Query logs to be processed:" , len(queryQ)
>
> > #Process install/update queue to generate list of valid IDs
> > currentfile = 1
> > for file in iNuQ:
>
> > print "Processing", currentfile, "install/update log out of", len
> > (iNuQ)
> > print timeElapsed()
> > reader = csv.reader(open(inputdir+file),delimiter = '\t')
> > for row in reader:
> > IDs.add(row[2])
> > currentfile+=1
>
> Best not to call it 'file'; that's a built-in name.
>
> Also you could use 'enumerate', and joining filepaths is safer with
> os.path.join().
>
> for currentfile, filename in enumerate(iNuQ, start=1):
> print "Processing", currentfile, "install/update log out of", len(iNuQ)
> print timeElapsed()
> current_path = os.path.join(inputdir, filename)
> reader = csv.reader(open(current_path), delimiter = '\t')
> for row in reader:
> IDs.add(row[2])
>
>
>
> > print "Finished processing install/update logs"
> > print "Unique IDs found:" , len(IDs)
> > print "Total Time Elapsed:", timeElapsed()
>
> > currentfile = 1
> > for file in queryQ:
>
> Similar remarks to above ...
>
> > print "Processing", currentfile, "query log out of", len(queryQ)
> > print timeElapsed()
> > reader = csv.reader(open(inputdir+file), delimiter = '\t')
> > outputfile = csv.writer(open(outputdir+file), 'w')
>
> ... and also here.
>
> > for row in reader:
> > if row[2] in IDs:
> > ouputfile.writerow(row)
>
> Should be 'outputfile'.
>
> > else:
> > if row[2] in foundOnceInQuery:
> > foundOnceInQuery.remove(row[2])
>
> You're removing the ID here ...
>
> > outputfile.writerow(row)
> > #IDremovalQ.remove(row[2])
> > #foundTwiceInQuery.add(row[2])
>
> > else:
> > foundOnceInQuery.add(row[2])
>
> ... and adding it again here!
>
> > #IDremovalQ.add(row[2])
>
> > currentfile+=1
>
> For safety you should close the files after use.
>
> > print "Finished processing query logs and writing new files"
> > print "# of Query log entries removed:" , len(foundOnceInQuery)
> > print "Total Time Elapsed:", timeElapsed()
>
> Apart from that, it looks OK.
>
> How big are the q files? If they're not too big and most of the time
> you're not removing rows, you could put the output rows into a list and
> then create the output file only if rows have been removed, otherwise
> just copy the input file, which might be faster.
MRAB, could you please repost what I sent to you here as I meant to
post it in the main discussion.
More information about the Python-list
mailing list