question about speed of sequential string replacement vs regex or
Terry Reedy
tjreedy at udel.edu
Wed Sep 28 17:23:50 EDT 2011
On 9/28/2011 5:28 AM, Xah Lee wrote:
> curious question.
>
> suppose you have 300 different strings and they need all be replaced
> to say "aaa".
>
> is it faster to replace each one sequentially (i.e. replace first
> string to aaa, then do the 2nd, 3rd,...)
> , or is it faster to use a regex with “or” them all and do replace one
> shot? (i.e. "1ststr|2ndstr|3rdstr|..." -> aaa)
Here the problem is replace multiple random substrings with one random
substring that could create new matches. I would start with the re 'or'
solution.
> btw, the origin of this question is about writing a emacs lisp
> function that replace ~250 html named entities to unicode char.
As you noted this is a different problem in that there is a different
replacement for each. Also, the substrings being searched for are not
random but have a distinct and easily recognized structure. The
replacement cannot create a new match. So the multiple scan approach
*could* work.
Unspecified is whether the input is unicode or ascii bytes. If the
latter I might copy to a bytearray (mutable), scan forward, replace
entity defs with utf-8 encoding of the corresponding unicode (with a
dict lookup, and which I assume are *always* fewer chars), and shift
other chars to close any gaps created.
If the input is unicode, I might do the same with array.array (which is
where bytearray came from). Or I might use the standard idiom of
constructing a list of pieces of the original, with replacements, and
''.join() at the end.
--
Terry Jan Reedy
More information about the Python-list
mailing list