doing hundreds of re.subs efficiently on large strings

Mon Apr 7 21:49:18 EDT 2003

Bengt Richter wrote:
> On Mon, 07 Apr 2003 00:28:13 GMT, nihilo <exnihilo at NOmyrealCAPSbox.com> wrote:
> 
> 
>>Bengt Richter wrote:
>>
>>>Thanks for testing, that is interesting. It would be interesting to see
>>>the test code you ran, so we could see the reasons for .25 vs .43 (and
>>>I could revise my reality-model as necessary ;-)
>>>
>>>Regards,
>>>Bengt Richter
>>
>>Hi Bengt,
>>
>>Here is some code that I created to test the two versions.
>>
>>Original:
>>
> 
> [...]
> 
>>print time.clock() - start
>>
>>And the modified code is:
>>
>>import time, re
>>start = time.clock()
>>data = '' # same data
>>dict = { 'US and Britain' : 'Oceania', 'US and British' : 'Oceanian', 
> 
> [...~ 135 items ...]
> 
> 
> 
>>pat = re.compile('(Canada|the U\.S\.A\.|air force|International Herald 
> 
> [... ditto ...]
> 
>>strings = pat.split(data)
>>for i in xrange(1,len(strings),2):
>>    strings[i] = dict[strings[i]]
>>data = ''.join(strings)
>>
>>I just checked these again, and the results were .052 on average for the 
>>first version, and .080 on average for the second. The second would 
>>obviously be a lot faster if I pre-compiled the pattern and cPickled it, 
> 
> I had assumed pre-compiled regex, but I learned something anyway:
> The split is a _lot_ slower than I would have thought. I had guessed that
> it would have compiled to a fast state machine like flex might generate
> for or-ed static string matches, but something else must be happening.
> Maybe it was not worth special casing static ors.
> 
> The loop to substitute into the list plus joining the result is quite fast,
> but the split is 2+ orders of magnitude slower than the latter, and the
> regex compilation is somewhat faster than the latter, but not much, apparently.
> Something like 175:150:1 for split:regexcomp:subst-in-list-using-dict-and-join
> in a test I ran on a 63.5k html file with 100 matching substrings in the data.
> 
> 
>>but I wanted to include compilation time in the test.  The fastest 
> 
> Why? Just curious -- would the patterns be changing as often as the data?

Hmmm, I was thinking in terms of equivalent operations, or starting from 
a blank slate in each case, but that is silly. What is important is the 
time, and if one solution allows you to do some of the work beforehand, 
so much the better.

> 
>>version that I have tested is using Alex's replacement suggestion in 
> 
> What version of python are you running? Mine wouldn't compile the regex
> in the form Alex suggested (with every '|' term separately parenthesized,
> like '(x)|(y)' vs '(x|y)' ). Maybe I typoed when I typed the change, but
> I got
> 
>       File "D:\python22\lib\sre_compile.py", line 442, in compile
>         assert p.pattern.groups <= 100,\
>     AssertionError: sorry, but this version only supports 100 named groups
> 
> I was hoping to compare regex speed using the two forms. Can't pursue it now though...

I'm running 2.3a2, and got the same error. I compared without the 
parentheses, figuring that it would give me a rough estimate of how 
quick it was. I am going to break the pattern up into sets of 100 named 
groups when I have a little time. I'll post final times when I get 
around to it if you're curious, including your version with the patterns 
precompiled.

>>combination with pre-compilation and cPickle, but the naive data = 
>>data.replace is surprisingly fast.
>>
> 
> Yes.
> BTW, neither of the regex-based methods are equivalent in semantics
> to the naive sequence of data.replace statements, because each data.replace
> has the possibility (in general, even if maybe not in the actual example)
> of matching something created by a previous data.replace. 

Yes, I realize that. What I actually want is the semantics of the 
regexes, since I don't want to substitute for any of the words that have 
already been substituted. The data.replace method was just an easy way 
of getting the same result, since none of the data.replace output words 
appear as inputs to later data.replaces.

> Thanks for posting the code.
> 
> Regards,
> Bengt Richter

No problem. Thanks for all your help.

regards,

-nihilo