[Pythonmac-SIG] Pattern Matching Speeds?

Richard Gordon maccgi@bellsouth.net
Wed, 15 Sep 1999 00:49:44 -0400


I finally got around to reworking a Perl script to Python that I 
wrote to convert 2 digit yrs. into 4 digit yrs in text files for 
database import. In MacPerl, this processes about 1200 test records 
per second, while in MacPython, the rate is more like 800 per second 
and that's kind of disappointing. Multiple comparison tests were done 
on the same data and the same machine with both interpreters set to 
10240K. The data is about 36,000 tab delimited records and each has 
two dates in it- most need to be fixed but some don't.

I won't bore you with the perl code, but it's pretty simple and about 
what you would expect. I am pasting in the python code below and 
would appreciate it if anyone can spot something that might be 
bogging this thing down. Thanks.

##############
import re, sys, string

infile = open("Conkie:Desktop Folder:2to4:fmptest.tab", "r")
outfile = open("Conkie:Desktop Folder:2to4:fixed.tab", "w"

sys.stdout = outfile
data = infile.read()
paragraphs = string.split(data, '\n')
matchstr = re.compile(r'(\b\d\d*/)(\d\d*/)(\d\d)\b')

def cent(matchobj):
	centuryA = '19'
	centuryB = '20'
	if len(matchobj.group(1)) == 2:
		month = '0'+matchobj.group(1)
	else:
		month = matchobj.group(1)

	if len(matchobj.group(2)) == 2:
		day = '0'+matchobj.group(2)
	else:
		day = matchobj.group(2)

	if matchobj.group(3) > '89':
		newDate = month+day+centuryA+matchobj.group(3)
	else:
		newDate = month+day+centuryB+matchobj.group(3)
	return newDate

for paragraph in paragraphs:
	if not paragraph:
		break
	else:
		fixed_paragraph = matchstr.sub(cent, paragraph)
		print fixed_paragraph
##############

Richard Gordon
--------------------
Gordon Consulting & Design
Database Design/Scripting Languages
mailto:richard@richardgordon.net
http://www.richardgordon.net
770.971.6887 (voice)
770.216.1829 (fax)