Optimization of an RE substitution

Sean Holdsworth sh at hob.cx
Wed Dec 19 06:09:49 EST 2001


First some background to this. I've been working on software which makes
use of the smtplib module and noticed that for certain inputs I was
seeing very high resource usage; both memory and processor. I was curious
about this because the software has been heavily used without problem for
several months and the problem only showed up in the last few days.
Sending a particular 7MB mail message was taking about 10 minutes of
processor time and using about 600MB of memory. Normally I would expect
figures of a few seconds and maybe 20MB for this size input.

On investigation the thing that was unusual about the 7MB file was
the number of empty lines in the message body; for whatever reason
the bulk of the text was aproximately 6 million consecutive end of
line characters. I eventually tracked down what I was seeing to what I
think must be a pathological case for the underlying regular expression
library; a perl script which presumably uses the same library shows the
same behavior.

Here's some code that illustrates the problem.

import re, time
CRLF = '\r\n'
msg = '\r' * 100000
start = time.time()
msg = re.sub(r'(?:\r\n|\n|\r(?!\n))', CRLF, msg)
elapsed = time.time() - start
print '%0.2f' % elapsed
print len(msg)

The line with the re.sub call is taken from the smtplib module and is used
to ensure that a mail message has proper CRLF terminated lines as per the
SMTP protocol requirements. Anyway, this got me thinking if there was a
more efficient way of carrying out this task. I had a stab at it with the
following.

import string, time
CRLF = '\r\n'
msg = '\r' * 100000
start = time.time()
table = string.maketrans('\r\n', '\0\0')
msg = string.replace(msg, CRLF, '\0')
msg = string.translate(msg, table)
msg = string.replace(msg, '\0', CRLF)
elapsed = time.time() - start
print '%0.2f' % elapsed
print len(msg)

Then I remembered that I'm supposed to be handling 8 bit clean data...
I realise that even if I come up with a more efficient method of
doing this I'll still have to use a modified version of smtplib but
I'd be interested to hear people's suggestions and comments.

Sean



More information about the Python-list mailing list