question about speed of sequential string replacement vs regex or

Wed Sep 28 17:23:50 EDT 2011

On 9/28/2011 5:28 AM, Xah Lee wrote:
> curious question.
>
> suppose you have 300 different strings and they need all be replaced
> to say "aaa".
>
> is it faster to replace each one sequentially (i.e. replace first
> string to aaa, then do the 2nd, 3rd,...)
> , or is it faster to use a regex with “or” them all and do replace one
> shot? (i.e. "1ststr|2ndstr|3rdstr|..." ->  aaa)

Here the problem is replace multiple random substrings with one random 
substring that could create new matches. I would start with the re 'or' 
solution.

> btw, the origin of this question is about writing a emacs lisp
> function that replace ~250 html named entities to unicode char.

As you noted this is a different problem in that there is a different 
replacement for each. Also, the substrings being searched for are not 
random but have a distinct and easily recognized structure. The 
replacement cannot create a new match. So the multiple scan approach 
*could* work.

Unspecified is whether the input is unicode or ascii bytes. If the 
latter I might copy to a bytearray (mutable), scan forward, replace 
entity defs with utf-8 encoding of the corresponding unicode (with a 
dict lookup, and which I assume are *always* fewer chars), and shift 
other chars to close any gaps created.

If the input is unicode, I might do the same with array.array (which is 
where bytearray came from). Or I might use the standard idiom of 
constructing a list of pieces of the original, with replacements, and 
''.join() at the end.

-- 
Terry Jan Reedy