efficient mega-replacements

Alex Martelli aleax at aleax.it
Thu Jan 24 09:54:00 EST 2002


"Clark C . Evans" <cce at clarkevans.com> wrote in message
news:mailman.1011813577.11632.python-list at python.org...
> Is there an efficient way to do multiple replacements?
>
>             val = val.replace("'","''")\
>                      .replace("\\","\\\\\\\\\\\\\\\\")\
>                      .replace(chr(13)+chr(10),"\\\\\\\\n")\
>                      .replace(chr(10),"\\\\\\\\n")\
>                      .replace(chr(13),"\\\\\\\\n")\
>                      .replace('"','\\\\"')
>
> Or is this the best way to do it?

I suspect the best way may be something like (note this may
NOT work for your specific example, see later):


def replace_many(string_val, replacements_dictionary):

    def substitutor(match, replacements_dictionary=replacements_dictionary):
        return replacements_dictionary[match.group(0)]

    orem = re.compile('|'.join(map(re.escape,
replacements_dictionary.keys())))

    return orem.sub(substitutor, string_val)

[In Python 2.2, you can simply def substitutor as
    def substitutor(match):
and it will work just the same.]


The basic idea: pass in the needed substitutions as a dictionary where
both keys and values are strings.  Get the strings to substitute
(the keys of the dictionary), re.escape each so it will only match
literally, and join them all into a string with | (the re "or") as
the separator.  re.compile of this "or'ed" string gives us a re
object that matches any one of the strings to substitute.

Now, we use the sub method of that re object, passing it a function
as the first argument.  The sub method calls the function for each
non-overlapping match found in the string, each time passing the
match object corresponding to the match (so .group(0) gives the
matched substring); whatever the function returns is used in lieu
of the matched substring.


The only problem here is that dictionaries have no inherent order,
so this ONLY works reliably if there is no overlap at all between
the strings to substitute!  In your example, you need to locate
both chr(13)+chr(10), AND just chr(13), so this won't reliably
work -- the "or'ed string" might happen to match chr(13) rather
than chr(13)+chr(10).

The fix is not hard -- you need to pass the strings in the order
you want them substituted.  E.g., use a sequence of pairs rather
than a dictionary, and build the dictionary in replace_many (e.g.
in Python 2.2, type dict can be called with an argument that is
a sequence of pairs and returns the corresponding dict object.


In this worth it, particularly using re rather than calling the
replace method many times?  Maybe, for MEGA replacements indeed,
i.e., LOTS of substrings to replace over one huge string.  But
what is surely wortwhile is *encapsulating* the task.  Once
you've decided on the interface needed, e.g. a function whose
first argument is the target string, second argument the
pair of key/value for substitutions, you can perfectly well
start off with the simplest implementation you can think of:

def replacer(astring, pairs):
    result = astring
    for old, new in pairs:
        result = result.replace(old, new)
    return result

and then revisit its performance only IF, and when, performance
concerns arise.  Encapsulate, encapsulate: create useful and
reusable abstractions, implement them as simply as you can
get away with, and your programs will be easier to write, to
read, and to maintain!


Alex






More information about the Python-list mailing list