doing hundreds of re.subs efficiently on large strings
nihilo
user at domain.invalid
Mon Apr 7 21:49:18 EDT 2003
Bengt Richter wrote:
> On Mon, 07 Apr 2003 00:28:13 GMT, nihilo <exnihilo at NOmyrealCAPSbox.com> wrote:
>
>
>>Bengt Richter wrote:
>>
>>>Thanks for testing, that is interesting. It would be interesting to see
>>>the test code you ran, so we could see the reasons for .25 vs .43 (and
>>>I could revise my reality-model as necessary ;-)
>>>
>>>Regards,
>>>Bengt Richter
>>
>>Hi Bengt,
>>
>>Here is some code that I created to test the two versions.
>>
>>Original:
>>
>
> [...]
>
>>print time.clock() - start
>>
>>And the modified code is:
>>
>>import time, re
>>start = time.clock()
>>data = '' # same data
>>dict = { 'US and Britain' : 'Oceania', 'US and British' : 'Oceanian',
>
> [...~ 135 items ...]
>
>
>
>>pat = re.compile('(Canada|the U\.S\.A\.|air force|International Herald
>
> [... ditto ...]
>
>>strings = pat.split(data)
>>for i in xrange(1,len(strings),2):
>> strings[i] = dict[strings[i]]
>>data = ''.join(strings)
>>
>>I just checked these again, and the results were .052 on average for the
>>first version, and .080 on average for the second. The second would
>>obviously be a lot faster if I pre-compiled the pattern and cPickled it,
>
> I had assumed pre-compiled regex, but I learned something anyway:
> The split is a _lot_ slower than I would have thought. I had guessed that
> it would have compiled to a fast state machine like flex might generate
> for or-ed static string matches, but something else must be happening.
> Maybe it was not worth special casing static ors.
>
> The loop to substitute into the list plus joining the result is quite fast,
> but the split is 2+ orders of magnitude slower than the latter, and the
> regex compilation is somewhat faster than the latter, but not much, apparently.
> Something like 175:150:1 for split:regexcomp:subst-in-list-using-dict-and-join
> in a test I ran on a 63.5k html file with 100 matching substrings in the data.
>
>
>>but I wanted to include compilation time in the test. The fastest
>
> Why? Just curious -- would the patterns be changing as often as the data?
Hmmm, I was thinking in terms of equivalent operations, or starting from
a blank slate in each case, but that is silly. What is important is the
time, and if one solution allows you to do some of the work beforehand,
so much the better.
>
>>version that I have tested is using Alex's replacement suggestion in
>
> What version of python are you running? Mine wouldn't compile the regex
> in the form Alex suggested (with every '|' term separately parenthesized,
> like '(x)|(y)' vs '(x|y)' ). Maybe I typoed when I typed the change, but
> I got
>
> File "D:\python22\lib\sre_compile.py", line 442, in compile
> assert p.pattern.groups <= 100,\
> AssertionError: sorry, but this version only supports 100 named groups
>
> I was hoping to compare regex speed using the two forms. Can't pursue it now though...
I'm running 2.3a2, and got the same error. I compared without the
parentheses, figuring that it would give me a rough estimate of how
quick it was. I am going to break the pattern up into sets of 100 named
groups when I have a little time. I'll post final times when I get
around to it if you're curious, including your version with the patterns
precompiled.
>>combination with pre-compilation and cPickle, but the naive data =
>>data.replace is surprisingly fast.
>>
>
> Yes.
> BTW, neither of the regex-based methods are equivalent in semantics
> to the naive sequence of data.replace statements, because each data.replace
> has the possibility (in general, even if maybe not in the actual example)
> of matching something created by a previous data.replace.
Yes, I realize that. What I actually want is the semantics of the
regexes, since I don't want to substitute for any of the words that have
already been substituted. The data.replace method was just an easy way
of getting the same result, since none of the data.replace output words
appear as inputs to later data.replaces.
> Thanks for posting the code.
>
> Regards,
> Bengt Richter
No problem. Thanks for all your help.
regards,
-nihilo
More information about the Python-list
mailing list