Find duplicates in a list/array and count them ...

Paul.Scipione at aps.com Paul.Scipione at aps.com
Fri Mar 27 15:08:27 EDT 2009


Hello,

I'm a newbie to Python.  I wrote a Python script which connect to my Geodatabase (ESRI ArcGIS File Geodatabase), retrieves the records, then proceeds to evaluate which ones are duplicated.  I do this using lists.  Someone suggested I use arrays instead.  Below is the content of my script.  Anyone have any ideas on how an array can improve performance?  Right now the script takes 2.5 minutes to run on a recordset of 79k+ records:

from __future__ import division
import sys, string, os, arcgisscripting, time
from time import localtime, strftime

def writeMessage(myMsg):
    print myMsg
    global log
    log = open(logFile, 'a')
    log.write(myMsg + "\n")

logFile = "c:\\temp\\" + str(strftime("%Y%m%d %H%M%S", localtime())) + ".log"

writeMessage(' ')
writeMessage(str(strftime("%H:%M:%S", localtime())) + ' begin unique values test')

# Create the Geoprocessor object
gp = arcgisscripting.create(9.3)
oid_list = []
dup_list = []
tmp_list = []
myWrkspc = "c:\\temp\\TVM Geodatabase GDIschema v6.0.2 PilotData.gdb"
myFtrCls = "\\Landbase\\T_GroundContour"

writeMessage(' ')
writeMessage('gdb: ' + myWrkspc)
writeMessage('ftr: ' + myFtrCls)
writeMessage(' ')
writeMessage(str(strftime("%H:%M:%S", localtime())) + ' retrieving recordset...')

rows = gp.SearchCursor(myWrkspc + myFtrCls,"","","GDI_OID")
row = rows.Next()
writeMessage(' ')
writeMessage(str(strftime("%H:%M:%S", localtime())) + ' processing recordset...')
while row:
    if row.GDI_OID in oid_list:
        tmp_list.append(row.GDI_OID)
    oid_list.append(row.GDI_OID)
    row = rows.Next()

writeMessage(' ')
writeMessage(str(strftime("%H:%M:%S", localtime())) + ' generating statistics...')

dup_count = len(tmp_list)
tmp_list = list(set(tmp_list))
tmp_list.sort()

for oid in tmp_list:
    a = str(oid) + '     '
    while len(a) < 20:
        a = a + ' '
    dup_list.append(a + '(' + str(oid_list.count(oid)) + ')')

for dup in dup_list:
    writeMessage(dup)

writeMessage(' ')
writeMessage('records    : ' + str(len(oid_list)))
writeMessage('duplicates : ' + str(dup_count))
writeMessage('% errors   : ' + str(round(dup_count / len(oid_list), 4)))
writeMessage(' ')
writeMessage(str(strftime("%H:%M:%S", localtime())) + ' unique values test complete')

log.close()
del dup, dup_count, dup_list, gp, log, logFile, myFtrCls, myWrkspc
del oid, oid_list, row, rows, tmp_list
exit()


Thanks!

Paul J. Scipione
GIS Database Administrator
work: 602-371-7091
cell: 480-980-4721



Email Firewall made the following annotations

---------------------------------------------------------------------
--- NOTICE ---

This message is for the designated recipient only and may contain confidential, privileged or proprietary information.  If you have received it in error, please notify the sender immediately and delete the original and any copy or printout.  Unintended recipients are prohibited from making any other use of this e-mail.  Although we have taken reasonable precautions to ensure no viruses are present in this e-mail, we accept no liability for any loss or damage arising from the use of this e-mail or attachments, or for any delay or errors or omissions in the contents which result from e-mail transmission.

---------------------------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20090327/127e17a3/attachment.html>


More information about the Python-list mailing list