Generator Expressions and CSV
MRAB
python at mrabarnett.plus.com
Fri Jul 17 17:31:30 EDT 2009
Zaki wrote:
> On Jul 17, 2:49 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
>> Zaki wrote:
>>> Hey all,
>>> I'm really new to Python and this may seem like a really dumb
>>> question, but basically, I wrote a script to do the following, however
>>> the processing time/memory usage is not what I'd like it to be. Any
>>> suggestions?
>>> Outline:
>>> 1. Read tab delim files from a directory, files are of 3 types:
>>> install, update, and q. All 3 types contain ID values that are the
>>> only part of interest.
>>> 2. Using set() and set.add(), generate a list of unique IDs from
>>> install and update files.
>>> 3. Using the set created in (2), check the q files to see if there are
>>> matches for IDs. Keep all matches, and add any non matches (which only
>>> occur once in the q file) to a queue of lines to be removed from teh q
>>> files.
>>> 4. Remove the lines in the q for each file. (I haven't quite written
>>> the code for this, but I was going to implement this using csv.writer
>>> and rewriting all the lines in the file except for the ones in the
>>> removal queue).
>>> Now, I've tried running this and it takes much longer than I'd like. I
>>> was wondering if there might be a better way to do things (I thought
>>> generator expressions might be a good way to attack this problem, as
>>> you could generate the set, and then check to see if there's a match,
>>> and write each line that way).
>> Why are you checking and removing lines in 2 steps? Why not copy the
>> matching lines to a new q file and then replace the old file with the
>> new one (or, maybe, delete the new q file if no lines were removed)?
>
> That's what I've done now.
>
> Here is the final code that I have running. It's very much 'hack' type
> code and not at all efficient or optimized and any help in optimizing
> it would be greatly appreciated.
>
> import csv
> import sys
> import os
> import time
>
> begin = time.time()
>
> #Check minutes elapsed
> def timeElapsed():
> current = time.time()
> elapsed = current-begin
> return round(elapsed/60)
>
>
> #USAGE: python logcleaner.py <input_dir> <output_dir>
>
> inputdir = sys.argv[1]
> outputdir = sys.argv[2]
>
> logfilenames = os.listdir(inputdir)
>
>
>
> IDs = set() #IDs from update and install logs
> foundOnceInQuery = set()
> #foundTwiceInQuery = set()
> #IDremovalQ = set() Note: Unnecessary, duplicate of foundOnceInQuery;
> Queue of IDs to remove from query logs (IDs found only once in query
> logs)
>
> #Generate Filename Queues For Install/Update Logs, Query Logs
> iNuQ = []
> queryQ = []
>
> for filename in logfilenames:
> if filename.startswith("par1.install") or filename.startswith
> ("par1.update"):
if filename.startswith(("par1.install", "par1.update")):
> iNuQ.append(filename)
> elif filename.startswith("par1.query"):
> queryQ.append(filename)
>
> totalfiles = len(iNuQ) + len(queryQ)
> print "Total # of Files to be Processed:" , totalfiles
> print "Install/Update Logs to be processed:" , len(iNuQ)
> print "Query logs to be processed:" , len(queryQ)
>
> #Process install/update queue to generate list of valid IDs
> currentfile = 1
> for file in iNuQ:
> print "Processing", currentfile, "install/update log out of", len
> (iNuQ)
> print timeElapsed()
> reader = csv.reader(open(inputdir+file),delimiter = '\t')
> for row in reader:
> IDs.add(row[2])
> currentfile+=1
Best not to call it 'file'; that's a built-in name.
Also you could use 'enumerate', and joining filepaths is safer with
os.path.join().
for currentfile, filename in enumerate(iNuQ, start=1):
print "Processing", currentfile, "install/update log out of", len(iNuQ)
print timeElapsed()
current_path = os.path.join(inputdir, filename)
reader = csv.reader(open(current_path), delimiter = '\t')
for row in reader:
IDs.add(row[2])
>
> print "Finished processing install/update logs"
> print "Unique IDs found:" , len(IDs)
> print "Total Time Elapsed:", timeElapsed()
>
> currentfile = 1
> for file in queryQ:
Similar remarks to above ...
> print "Processing", currentfile, "query log out of", len(queryQ)
> print timeElapsed()
> reader = csv.reader(open(inputdir+file), delimiter = '\t')
> outputfile = csv.writer(open(outputdir+file), 'w')
... and also here.
> for row in reader:
> if row[2] in IDs:
> ouputfile.writerow(row)
Should be 'outputfile'.
> else:
> if row[2] in foundOnceInQuery:
> foundOnceInQuery.remove(row[2])
You're removing the ID here ...
> outputfile.writerow(row)
> #IDremovalQ.remove(row[2])
> #foundTwiceInQuery.add(row[2])
>
> else:
> foundOnceInQuery.add(row[2])
... and adding it again here!
> #IDremovalQ.add(row[2])
>
>
> currentfile+=1
>
For safety you should close the files after use.
> print "Finished processing query logs and writing new files"
> print "# of Query log entries removed:" , len(foundOnceInQuery)
> print "Total Time Elapsed:", timeElapsed()
>
Apart from that, it looks OK.
How big are the q files? If they're not too big and most of the time
you're not removing rows, you could put the output rows into a list and
then create the output file only if rows have been removed, otherwise
just copy the input file, which might be faster.
More information about the Python-list
mailing list