Checking for unique fields: performance.
Shawn at Milochik.com
Fri Apr 18 17:23:08 CEST 2008
I'm looping through a tab-delimited file to gather statistics on fill rates,
lengths, and uniqueness.
For the uniqueness, I made a dictionary with keys which correspond to the
field names. The values were originally lists, where I would store values
found in that field. Once I detected a duplicate, I deleted the entire
element from the dictionary. Any which remained by the end are considered
unique. Also, if the value was empty, the dictionary element was deleted and
that field considered not unique.
A friend of mine suggested changing that dictionary of lists into a
dictionary of dictionaries, for performance reasons. As it turns out, the
speed increase was ridiculous -- a file which took 42 minutes to run dropped
down to six seconds.
Here is the excerpt of the bit of code which checks for uniqueness. It's
fully functional, so I'm just looking for any suggestions for improving it
or any comments. Note that fieldNames is a list containing all column
#check for unique values
#if we are still tracking that field (we haven't yet
#found a duplicate value).
#if the current value is a duplicate
#sys.stderr.write("Field %s is not unique. Found a
duplicate value after checking %d values.\n" % (fieldNames[index], lineNum))
#drop the whole hash element
#add the new value to the list
fieldUnique[fieldNames[index]][value] = 1
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-list