doing hundreds of re.subs efficiently on large strings

Thu Mar 27 16:45:16 EST 2003

Bengt Richter wrote:
> On Tue, 25 Mar 2003 21:46:04 GMT, nihilo <exnihilo at NOmyrealCAPSbox.com> wrote:
<snip/>
> If not, you could split on the befores and then walk through the list
> and substitute corresponding afters and join the result, e.g.,
> 
> An example source:
> 
>  >>> s = """\
>  ... before1, before2 plain stuff
>  ... and before3 and before4, and
>  ... some more plain stuff.
>  ... """
>  >>> print s
>  before1, before2 plain stuff
>  and before3 and before4, and
>  some more plain stuff.
> 
> Regex to split out befores:
>  >>> import re
>  >>> rxo = re.compile(r'(before1|before2|before3|before4)')
> 
> The parens retain the matches as the odd indexed items:
>  >>> rxo.split(s)
>  ['', 'before1', ', ', 'before2', ' plain stuff\nand ', 'before3', ' and ', 'before4', ', and\nsome more plain stuff.\n']
> 
> A dict to look up substitutions:
>  >>> subdict = dict([('before'+x, 'after'+x) for x in '1234'])
>  >>> subdict
>  {'before4': 'after4', 'before1': 'after1', 'before2': 'after2', 'before3': 'after3'}
> 
> As above, but bind to s, so we can do substitutions on the odd elements:
>  >>> s = rxo.split(s)
>  >>> s
>  ['', 'before1', ', ', 'before2', ' plain stuff\nand ', 'before3', ' and ', 'before4', ', and\nsome more plain stuff.\n']
> 
> Do the substitution:
>  >>> for i in xrange(1,len(s),2): s[i] = subdict[s[i]]
>  ...
>  >>> s
>  ['', 'after1', ', ', 'after2', ' plain stuff\nand ', 'after3', ' and ', 'after4', ', and\nsome more plain stuff.\n']
> 
> Join into single string:
>  >>> ''.join(s)
>  'after1, after2 plain stuff\nand after3 and after4, and\nsome more plain stuff.\n'
> 
> Print it to see in original format:
>  >>> print ''.join(s)
>  after1, after2 plain stuff
>  and after3 and after4, and
>  some more plain stuff.
> 
> You could easily wrap this in a function, of course.
> 
> Regards,
> Bengt Richter

I finally got around to actually testing the time for each of these 
approaches. Method 1 was with a bunch of string.replaces, and method 2 
was compiling everything into one big regex, spliting the string, and 
using a dictionary for lookup of the term to substitute for all the 
odd-numbered items in the list (the matched terms).  The results were 
very surprising. string.replace averaged .25 seconds, while the method 
outlined above averaged  .43 seconds!

The bottleneck in my code turned out to be the regular expressions, 
which take almost a second and a half in total. Since they all contain 
groups, I think I'm stuck with what I have at present, though I'm 
wondering whether it is worthwhile to precompile all the expressions and 
use cPickle to load them from a file.

Anyway, thanks again for the help. Your solution is more elegant than a 
hundred and fifty replaces, and better memory-wise too, even if it is a 
bit slower.

-nihilo