substitution
Anthra Norell
anthra.norell at bluewin.ch
Fri Jan 22 02:29:10 EST 2010
Iain King wrote:
> On Jan 21, 2:18 pm, Wilbert Berendsen <wbs... at xs4all.nl> wrote:
>
>> Op maandag 18 januari 2010 schreef Adi:
>>
>>
>>> keys = [(len(key), key) for key in mapping.keys()]
>>> keys.sort(reverse=True)
>>> keys = [key for (_, key) in keys]
>>>
>>> pattern = "(%s)" % "|".join(keys)
>>> repl = lambda x : mapping[x.group(1)]
>>> s = "fooxxxbazyyyquuux"
>>>
>>> re.subn(pattern, repl, s)
>>>
>> I managed to make it even shorted, using the key argument for sorted, not
>> putting the whole regexp inside parentheses and pre-compiling the regular
>> expression:
>>
>> import re
>>
>> mapping = {
>> "foo" : "bar",
>> "baz" : "quux",
>> "quuux" : "foo"
>>
>> }
>>
>> # sort the keys, longest first, so 'aa' gets matched before 'a', because
>> # in Python regexps the first match (going from left to right) in a
>> # |-separated group is taken
>> keys = sorted(mapping.keys(), key=len)
>>
>> rx = re.compile("|".join(keys))
>> repl = lambda x: mapping[x.group()]
>> s = "fooxxxbazyyyquuux"
>> rx.sub(repl, s)
>>
>> One thing remaining: if the replacement keys could contain non-alphanumeric
>> characters, they should be escaped using re.escape:
>>
>> rx = re.compile("|".join(re.escape(key) for key in keys))
>>
>> Met vriendelijke groet,
>> Wilbert Berendsen
>>
>> --http://www.wilbertberendsen.nl/
>> "You must be the change you wish to see in the world."
>> -- Mahatma Gandhi
>>
>
> Sorting it isn't the right solution: easier to hold the subs as tuple
> pairs and by doing so let the user specify order. Think of the
> following subs:
>
> "fooxx" -> "baz"
> "oxxx" -> "bar"
>
> does the user want "bazxbazyyyquuux" or "fobarbazyyyquuux"?
>
> Iain
>
There is no way you can automate a user's choice. If he wants the second
choice (oxxx->bar) he would have to add a third pattern: fooxxx ->
fobar. In general, the rules 'upstream over downstream' and 'long over
short' make sense in practically all cases. With all but simple
substitution runs whose functionality is obvious, the result needs to be
checked for unintended hits. To use an example from my SE manual which
runs a (whimsical) text through a set of substitutions concentrating
overlapping targets:
>>> substitutions = [['be', 'BE'], ['being', 'BEING'], ['been',
'BEEN'], ['bee', 'BEE'], ['belong', 'BELONG'], ['long', 'LONG'],
['longer', 'LONGER']]
>>> T = Translator (substitutions) # Code further up in this thread
handling precedence by the two rules mentioned
>>> text = "There was a bee named Mabel belonging to hive nine longing
to be a beetle and thinking that being a bee was okay, but she had been
a bee long enough and wouldn't be one much longer."
>>> print T (text)
There was a BEE named MaBEl BELONGing to hive nine LONGing to BE a
BEEtle and thinking that BEING a BEE was okay, but she had BEEN a BEE
LONG enough and wouldn't BE one much LONGER.
All word-length substitutions resolve correctly. There are four
unintended translations, though: MaBEl, BELONGing, LONGing and BEEtle.
Adding the substitution Mabel->Mabel would prevent the first miss. The
others could be taken care of similarly by replacing the target with
itself. With large substitution sets and extensive data, this amounts to
an iterative process of running, checking and fixing, many times over.
That just isn't practical and may have to be abandoned when the
substitutions catalog grows out of reasonable bounds. Dependable are
runs where the targets are predictably singular, such as long id numbers
that cannot possibly match anything but id numbers.
Frederic
More information about the Python-list
mailing list