Find duplicates in a list/array and count them ...
MRAB
google at mrabarnett.plus.com
Fri Mar 27 15:44:10 EDT 2009
Paul.Scipione at aps.com wrote:
> Hello,
>
> I'm a newbie to Python. I wrote a Python script which connect to my
> Geodatabase (ESRI ArcGIS File Geodatabase), retrieves the records, then
> proceeds to evaluate which ones are duplicated. I do this using lists.
> Someone suggested I use arrays instead. Below is the content of my
> script. Anyone have any ideas on how an array can improve performance?
> Right now the script takes 2.5 minutes to run on a recordset of 79k+
> records:
>
> from __future__ import division
> import sys, string, os, arcgisscripting, time
> from time import localtime, strftime
>
> def writeMessage(myMsg):
> print myMsg
--- delete start ---
> global log
> log = open(logFile, 'a')
--- delete end ---
Open the log file once in the main part of the script.
> log.write(myMsg + "\n")
>
> logFile = "c:\\temp\\" + str(strftime("%Y%m%d %H%M%S", localtime())) +
> ".log"
>
log = open(logFile, 'a')
> writeMessage(' ')
> writeMessage(str(strftime("%H:%M:%S", localtime())) + ' begin unique
> values test')
>
> # Create the Geoprocessor object
> gp = arcgisscripting.create(9.3)
> oid_list = []
> dup_list = []
> tmp_list = []
> myWrkspc = "c:\\temp\\TVM Geodatabase GDIschema v6.0.2 PilotData.gdb"
> myFtrCls = "_\\Landbase\\T_GroundContour_"
>
> writeMessage(' ')
> writeMessage('gdb: ' + myWrkspc)
> writeMessage('ftr: ' + myFtrCls)
> writeMessage(' ')
> writeMessage(str(strftime("%H:%M:%S", localtime())) + ' retrieving
> recordset...')
>
> rows = gp.SearchCursor(myWrkspc + myFtrCls,"","","GDI_OID")
> row = rows.Next()
> writeMessage(' ')
> writeMessage(str(strftime("%H:%M:%S", localtime())) + ' processing
> recordset...')
--- delete start ---
> while row:
> if row.GDI_OID in oid_list:
> tmp_list.append(row.GDI_OID)
> oid_list.append(row.GDI_OID)
> row = rows.Next()
--- delete end ---
You're searching a list. The longer the list becomes, the longer it'll
take to search it. Build a dict of how many times you've seen each item.
oid_count = {}
while row:
oid_count[row.GDI_OID] = oid_count.get(row.GDI_OID, 0) + 1
oid_list.append(row.GDI_OID)
row = rows.Next()
>
> writeMessage(' ')
> writeMessage(str(strftime("%H:%M:%S", localtime())) + ' generating
> statistics...')
>
-- delete start ---
> dup_count = len(tmp_list)
> tmp_list = list(set(tmp_list))
> tmp_list.sort()
--- delete end ---
You now want to know which ones occurred more than once.
tmp_list = [oid for oid, count in oid_count.iteritems() if count > 1]
tmp_list.sort()
dup_count = len(tmp_list)
--- delete start ---
> for oid in tmp_list:
> a = str(oid) + ' '
> while len(a) < 20:
> a = a + ' '
> dup_list.append(a + '(' + str(oid_list.count(oid)) + ')')
--- delete end ---
And here you're scanning the entire list _for every item_; if there are
'n' items then it's being scanned 'n' times!
The number of times each item occurred is now stored in oid_count.
for oid in tmp_list:
a = str(oid) + ' '
while len(a) < 20:
a = a + ' '
dup_list.append(a + '(' + str(oid_count[oid]) + ')')
>
> for dup in dup_list:
> writeMessage(dup)
>
> writeMessage(' ')
> writeMessage('records : ' + str(len(oid_list)))
> writeMessage('duplicates : ' + str(dup_count))
> writeMessage('% errors : ' + str(round(dup_count / len(oid_list), 4)))
> writeMessage(' ')
> writeMessage(str(strftime("%H:%M:%S", localtime())) + ' unique values
> test complete')
>
> log.close()
--- delete start ---
> del dup, dup_count, dup_list, gp, log, logFile, myFtrCls, myWrkspc
> del oid, oid_list, row, rows, tmp_list
> exit()
--- delete end ---
Not necessary, and it'll exit at the end of the script anyway.
More information about the Python-list
mailing list