in str.replace(old, new), allow 'old' to accept a tuple

Hi, I find the fact that 'prefix' in str.startswith(prefix) accept a tuple quite useful. That's because one can do a match on more than one pattern at a time, without ugliness. Would it be a good idea to do the same for str.replace(old, new)? before
'foo bar baz'.replace('foo', 'baz').replace('bar', 'baz') baz baz baz
after
'foo bar baz'.replace(('foo', 'bar'), 'baz') baz baz baz

Tshepang Lekhonkhobe schrieb am Wed, 11. Apr 2012, um 22:35:54 +0200:
The usual current solution is to use `re.sub`: >>> re.sub("foo|bar", "baz", "foo bar baz") 'baz baz baz' or, for a general iterable of patterns re.sub("|".join(map(re.escape, patterns)), repl, string) Cheers, Sven

Tshepang Lekhonkhobe <tshepang@gmail.com> writes:
'foo bar baz'.replace(('foo', 'bar'), 'baz') baz baz baz
How about: 'foo bar baz'.replace(('foo', 'bar'), 'foobar') You can't replace multiple matches “at the same time”, as you're implying. The order of replacements is important, since it will affect the outcome in many cases. Do you think it's important to allow a set as the first argument to str.replace()? search_strings = set(['foo', 'bar']) 'foo bar baz'.replace(search_strings, 'foobar') I think that would be at least as desirable as your proposal; but what would be the order of replacements? -- \ “Shepherds … look after their sheep so they can, first, fleece | `\ them and second, turn them into meat. That's much more like the | _o__) priesthood as I know it.” —Christopher Hitchens, 2008-10-29 | Ben Finney

On Apr 11, 2012, at 3:47 PM, Ben Finney wrote:
Can't you say the same about 'a b c'.replace("a", "aa")? I think the case of the needles overlapping is more to your point though. "abc".replace( ("ab", "bc"), "b") What should that produce? "bc"? "b"? "ab" even (if we ignore the order of the tuple)?

"Carl M. Johnson" <cmjohnson.mailinglist@gmail.com> writes:
Not the same thing. The matches *can* be all “at the same time”, in every case, since only a single pattern is being matched. Then, once all those matches are found, they're all replaced. So it's not a problem. I'm pointing out that, if distinct patterns are being matched and replaced, then the order of replacement matters.
Yes, these and other cases make it problematic to think in terms of “replace them all at the same time”. The replacements should be done in an order predictable by the person reading the code. And if they should be done in order, then that order should be explicit. I think the existing solution helps with that. -- \ “… it's best to confuse only one issue at a time.” —Brian W. | `\ Kernighan and Dennis M. Ritchie, _The C programming language_, | _o__) 1988 | Ben Finney

On 12/04/2012 03:46, Ben Finney wrote:
And if they should be done in order, then that order should be explicit. I think the existing solution helps with that.
Something along the lines of
Or can this be simplified with the Python Swiss Army Knife aka the itertools module? :) -- Cheers. Mark Lawrence.

Ben Finney wrote:
Tshepang Lekhonkhobe <tshepang@gmail.com> writes:
'foo bar baz'.replace(('foo', 'bar'), 'baz')
You can't replace multiple matches “at the same time”, as you're implying.
An obvious thing to do is to try them in the order they appear in the sequence. That would argue against allowing an unordered collection. Not quite so obvious is whether the replacements should be considered as candidates for further replacements. I would say not, because it complicates the algorithm and in my experience is rarely needed. If you want that, you would just have to do multiple replace calls like you do now. And how about allowing a sequence of (old, new) pairs instead of just a single replacement? That would be even more useful. -- Greg

On 12Apr2012 14:47, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote: | Ben Finney wrote: | > Tshepang Lekhonkhobe <tshepang@gmail.com> | > writes: | >>>>>'foo bar baz'.replace(('foo', 'bar'), 'baz') | | > You can't replace multiple matches “at the same time”, as you're | > implying. | | An obvious thing to do is to try them in the order they | appear in the sequence. That would argue against allowing | an unordered collection. And likewise with Ben's set() suggestion. I for one would allow it. If the order matters, the caller can produce a sequence with the required order. If the order doesn't matter (you know no replacement overlaps, and no replacement introduces text that itself should get replaced), then why not allow a set? I vote for any iterable if this ges ahead. The specification should sy that replacements happen in the order items come from the iterable, leaving the choice of control up to the caller but providing predicable behaviour if the caller provides a predictable sequence. | Not quite so obvious is whether the replacements should | be considered as candidates for further replacements. | I would say not, because it complicates the algorithm | and in my experience is rarely needed. Not to mention recursion! | If you want that, | you would just have to do multiple replace calls like | you do now. | | And how about allowing a sequence of (old, new) pairs | instead of just a single replacement? That would be even | more useful. Sure. But doesn't that break the function signature? I suppose we're already there though. Do you want to special case the single string replacement or require callers to use zip(repls, [ "foo" for s in repls ])? Personally, I would require the zip; the, um, flexibility of the %-format operator with string-vs-list has long bothered me to the point %of always providing a sequence, even a single element tuple. -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ Mountain rescue teams insist the all climbers wear helmets, and fall haedfirst. They are then impacted into a small globular mass easily stowed in a rucsac. - Tom Patey, who didnt, and wasnt

Greg Ewing <greg.ewing@canterbury.ac.nz> writes:
An obvious thing to do is to try them in the order they appear in the sequence. That would argue against allowing an unordered collection.
For that reason, I'm −0.5 on the proposal. If we're to specify multiple match patterns and do them all in a single operation, I'd prefer to specify them in e.g. a set or some other efficient non-ordered collection. -- \ “On the other hand, you have different fingers.” —Steven Wright | `\ | _o__) | Ben Finney

On 12Apr2012 11:47, Ben Finney <ben+python@benfinney.id.au> wrote: | Tshepang Lekhonkhobe <tshepang@gmail.com> | writes: | | > >>> 'foo bar baz'.replace(('foo', 'bar'), 'baz') | > baz baz baz | | How about: | | 'foo bar baz'.replace(('foo', 'bar'), 'foobar') | | You can't replace multiple matches “at the same time”, as you're | implying. The order of replacements is important, since it will affect | the outcome in many cases. "At the same time" might imply something equivalent to the cited "re.sub('foo|bar',...)" suggestion. And that is different to an iterated "replace foo, then replace bar" if the possible matched overlap. Just a thought about what semantics the OP may have envisaged. Personally, given re.sub and the ease of running replace a few times in a loop, I'm -0.3 on the suggestion itself. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ A software engineering discussion from Perl-Porters: Chip Salzenberg: The wise one has seen the calamity, and has proceeded to hide himself. - Ecclesiastes Gurusamy Sarathy: He that observeth the wind shall not sow; and he that regardeth the clouds shall not reap.

Cameron Simpson <cs@zip.com.au> writes:
Yes, it is; but the OP presented a proposal as though it were to have the same semantics as a sequence of replace operations. If the OP wants to specify different semantics, let's hear it. -- \ “A child of five could understand this. Fetch me a child of | `\ five.” —Groucho Marx | _o__) | Ben Finney

On Thu, Apr 12, 2012 at 2:37 PM, Ben Finney <ben+python@benfinney.id.au> wrote:
If the OP wants to specify different semantics, let's hear it.
Whatever semantics were chosen, they would end up being confusing to *someone*. With prefix and suffix matching, the implicit OR is simple and obvious. The same can't be said for the replacement command, particular if it can be used with unordered collections. Far better to leave this task to re.sub (which uses regex syntax to avoid ambiguity) or to explicit flow control and multiple invocations of replace(). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 12/04/2012 05:56, Nick Coghlan wrote:
I rather like this proposal. The semantics for s.replace(strings, replacementString) could be: 'strings', if not a string, must be a tuple, for consistency with str.startswith (although I don't see why a list shouldn't be allowed for both). Scan s from left to right; whenever a match is found with any member of 'strings' (tested in the order specified by 'strings'), do the replacement. The replaced text is not eligible for further replacement. But the real value for such proposals is not the complicated cases where the precise semantics matter, but the convenience in simple cases (almost any language feature CAN be used in an obscure way), e.g. def dequote(s): singlequote = "'" doublequote = '"' return s.replace((singlequote, doublequote), '') +0.8 Rob Cliffe

On Thu, Apr 12, 2012 at 06:37, Ben Finney <ben+python@benfinney.id.au> wrote:
You guys are thinking more deeply about this than I was. I don't even see a difference between the 2:
'foo bar baz'.replace('foo', 'baz').replace('bar', 'baz') == re.sub('foo|bar', 'baz', 'foo bar baz') True
I was not even thinking about ordering, but it would help to have it to avoid confusion I think. The example I gave was just the closest I could think of.

I want multiple replace at once. For example html escape looks like:
"<>&".replace('<', '<', '>', '>', '&', '&') '<>&'
or
"<>&".replace( ('<', '<'), ('>', '>'), ('&', '&') ) '<>&'
On Thu, Apr 12, 2012 at 6:39 PM, Tshepang Lekhonkhobe <tshepang@gmail.com> wrote:
-- INADA Naoki <songofacandy@gmail.com>

INADA Naoki schrieb am Thu, 12. Apr 2012, um 20:32:45 +0900:
"<>&".replace( ('<', '<'), ('>', '>'), ('&', '&') ) '<>&'
In current Python, it's >>> t = str.maketrans({"<": "<", ">": ">", "&": "&"}) >>> "<>&".translate(t) '<>&' Looks good enough for me. Cheers, Sven

INADA Naoki schrieb am Thu, 12. Apr 2012, um 22:17:30 +0900:
Oh, I didn't know that. Thank you. But what about unescape? str.translate accepts only one character key.
You'd currently need to use the `re` module: >>> d = {"&": "&", ">": ">", "<": "<"} >>> re.sub("|".join(d), lambda m: d[m.group()], "<>&") '<>&' Cheers, Sven

INADA Naoki, 12.04.2012 18:32:
Simpler, maybe, at least at the API level. But faster? Not necesarily. It could use Aho-Corasick, but that means it needs to construct the search graph on each call, which is fairly expensive. And str.replace() isn't the right interface for anything but a one-shot operation if the intention is to pass in a sequence of keywords. Stefan

Stefan Behnel wrote:
So maybe a better approach would be to enhance maketrans so that both keys and replacements can be more than one character long? Behind the scenes, it could build a DFA or whatever is needed to do it efficiently. -- Greg

You're right. But in simple situation, overhead of making match object and calling callback is more expensive. (ex. https://gist.github.com/2369648 ) I think chaining replace is not so bad for such simple cases. So a problem is there are no "one obvious way" to replace multiple keywords. On Fri, Apr 13, 2012 at 2:08 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
-- INADA Naoki <songofacandy@gmail.com>

On Apr 11, 2012, at 4:35 PM, Tshepang Lekhonkhobe wrote:
It seems to meet that it is a rare use case to want to replace many things with a single replacement string. I can't remember a single case of ever needing this. This only thing that comes to mind is automated redaction. What I have needed and have seen others need is a dictionary based replace: {'customer': 'client', 'headquarters': 'office', 'now': 'soon'}. Even that case is a fraught with peril -- I would want "now" to change to "soon" but not have "snow" change to "ssoon". In the end, I think want people want is to have the power and control afforded by re.sub() but without having to learn regular expressions. Raymond

On Fri, Apr 13, 2012 at 11:03 AM, Raymond Hettinger <raymond.hettinger@gmail.com> wrote:
There is one very attractive special case, however, which is an invertible translation like URL-escaping (or HTML-escaping), where at least one side of the transform is single characters. Then there is no ambiguity. Nevertheless, I think that case is special enough that it may as well be done in the modules that deal with URLs and HTML respectively.

Tshepang Lekhonkhobe schrieb am Wed, 11. Apr 2012, um 22:35:54 +0200:
The usual current solution is to use `re.sub`: >>> re.sub("foo|bar", "baz", "foo bar baz") 'baz baz baz' or, for a general iterable of patterns re.sub("|".join(map(re.escape, patterns)), repl, string) Cheers, Sven

Tshepang Lekhonkhobe <tshepang@gmail.com> writes:
'foo bar baz'.replace(('foo', 'bar'), 'baz') baz baz baz
How about: 'foo bar baz'.replace(('foo', 'bar'), 'foobar') You can't replace multiple matches “at the same time”, as you're implying. The order of replacements is important, since it will affect the outcome in many cases. Do you think it's important to allow a set as the first argument to str.replace()? search_strings = set(['foo', 'bar']) 'foo bar baz'.replace(search_strings, 'foobar') I think that would be at least as desirable as your proposal; but what would be the order of replacements? -- \ “Shepherds … look after their sheep so they can, first, fleece | `\ them and second, turn them into meat. That's much more like the | _o__) priesthood as I know it.” —Christopher Hitchens, 2008-10-29 | Ben Finney

On Apr 11, 2012, at 3:47 PM, Ben Finney wrote:
Can't you say the same about 'a b c'.replace("a", "aa")? I think the case of the needles overlapping is more to your point though. "abc".replace( ("ab", "bc"), "b") What should that produce? "bc"? "b"? "ab" even (if we ignore the order of the tuple)?

"Carl M. Johnson" <cmjohnson.mailinglist@gmail.com> writes:
Not the same thing. The matches *can* be all “at the same time”, in every case, since only a single pattern is being matched. Then, once all those matches are found, they're all replaced. So it's not a problem. I'm pointing out that, if distinct patterns are being matched and replaced, then the order of replacement matters.
Yes, these and other cases make it problematic to think in terms of “replace them all at the same time”. The replacements should be done in an order predictable by the person reading the code. And if they should be done in order, then that order should be explicit. I think the existing solution helps with that. -- \ “… it's best to confuse only one issue at a time.” —Brian W. | `\ Kernighan and Dennis M. Ritchie, _The C programming language_, | _o__) 1988 | Ben Finney

On 12/04/2012 03:46, Ben Finney wrote:
And if they should be done in order, then that order should be explicit. I think the existing solution helps with that.
Something along the lines of
Or can this be simplified with the Python Swiss Army Knife aka the itertools module? :) -- Cheers. Mark Lawrence.

Ben Finney wrote:
Tshepang Lekhonkhobe <tshepang@gmail.com> writes:
'foo bar baz'.replace(('foo', 'bar'), 'baz')
You can't replace multiple matches “at the same time”, as you're implying.
An obvious thing to do is to try them in the order they appear in the sequence. That would argue against allowing an unordered collection. Not quite so obvious is whether the replacements should be considered as candidates for further replacements. I would say not, because it complicates the algorithm and in my experience is rarely needed. If you want that, you would just have to do multiple replace calls like you do now. And how about allowing a sequence of (old, new) pairs instead of just a single replacement? That would be even more useful. -- Greg

On 12Apr2012 14:47, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote: | Ben Finney wrote: | > Tshepang Lekhonkhobe <tshepang@gmail.com> | > writes: | >>>>>'foo bar baz'.replace(('foo', 'bar'), 'baz') | | > You can't replace multiple matches “at the same time”, as you're | > implying. | | An obvious thing to do is to try them in the order they | appear in the sequence. That would argue against allowing | an unordered collection. And likewise with Ben's set() suggestion. I for one would allow it. If the order matters, the caller can produce a sequence with the required order. If the order doesn't matter (you know no replacement overlaps, and no replacement introduces text that itself should get replaced), then why not allow a set? I vote for any iterable if this ges ahead. The specification should sy that replacements happen in the order items come from the iterable, leaving the choice of control up to the caller but providing predicable behaviour if the caller provides a predictable sequence. | Not quite so obvious is whether the replacements should | be considered as candidates for further replacements. | I would say not, because it complicates the algorithm | and in my experience is rarely needed. Not to mention recursion! | If you want that, | you would just have to do multiple replace calls like | you do now. | | And how about allowing a sequence of (old, new) pairs | instead of just a single replacement? That would be even | more useful. Sure. But doesn't that break the function signature? I suppose we're already there though. Do you want to special case the single string replacement or require callers to use zip(repls, [ "foo" for s in repls ])? Personally, I would require the zip; the, um, flexibility of the %-format operator with string-vs-list has long bothered me to the point %of always providing a sequence, even a single element tuple. -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ Mountain rescue teams insist the all climbers wear helmets, and fall haedfirst. They are then impacted into a small globular mass easily stowed in a rucsac. - Tom Patey, who didnt, and wasnt

Greg Ewing <greg.ewing@canterbury.ac.nz> writes:
An obvious thing to do is to try them in the order they appear in the sequence. That would argue against allowing an unordered collection.
For that reason, I'm −0.5 on the proposal. If we're to specify multiple match patterns and do them all in a single operation, I'd prefer to specify them in e.g. a set or some other efficient non-ordered collection. -- \ “On the other hand, you have different fingers.” —Steven Wright | `\ | _o__) | Ben Finney

On 12Apr2012 11:47, Ben Finney <ben+python@benfinney.id.au> wrote: | Tshepang Lekhonkhobe <tshepang@gmail.com> | writes: | | > >>> 'foo bar baz'.replace(('foo', 'bar'), 'baz') | > baz baz baz | | How about: | | 'foo bar baz'.replace(('foo', 'bar'), 'foobar') | | You can't replace multiple matches “at the same time”, as you're | implying. The order of replacements is important, since it will affect | the outcome in many cases. "At the same time" might imply something equivalent to the cited "re.sub('foo|bar',...)" suggestion. And that is different to an iterated "replace foo, then replace bar" if the possible matched overlap. Just a thought about what semantics the OP may have envisaged. Personally, given re.sub and the ease of running replace a few times in a loop, I'm -0.3 on the suggestion itself. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ A software engineering discussion from Perl-Porters: Chip Salzenberg: The wise one has seen the calamity, and has proceeded to hide himself. - Ecclesiastes Gurusamy Sarathy: He that observeth the wind shall not sow; and he that regardeth the clouds shall not reap.

Cameron Simpson <cs@zip.com.au> writes:
Yes, it is; but the OP presented a proposal as though it were to have the same semantics as a sequence of replace operations. If the OP wants to specify different semantics, let's hear it. -- \ “A child of five could understand this. Fetch me a child of | `\ five.” —Groucho Marx | _o__) | Ben Finney

On Thu, Apr 12, 2012 at 2:37 PM, Ben Finney <ben+python@benfinney.id.au> wrote:
If the OP wants to specify different semantics, let's hear it.
Whatever semantics were chosen, they would end up being confusing to *someone*. With prefix and suffix matching, the implicit OR is simple and obvious. The same can't be said for the replacement command, particular if it can be used with unordered collections. Far better to leave this task to re.sub (which uses regex syntax to avoid ambiguity) or to explicit flow control and multiple invocations of replace(). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 12/04/2012 05:56, Nick Coghlan wrote:
I rather like this proposal. The semantics for s.replace(strings, replacementString) could be: 'strings', if not a string, must be a tuple, for consistency with str.startswith (although I don't see why a list shouldn't be allowed for both). Scan s from left to right; whenever a match is found with any member of 'strings' (tested in the order specified by 'strings'), do the replacement. The replaced text is not eligible for further replacement. But the real value for such proposals is not the complicated cases where the precise semantics matter, but the convenience in simple cases (almost any language feature CAN be used in an obscure way), e.g. def dequote(s): singlequote = "'" doublequote = '"' return s.replace((singlequote, doublequote), '') +0.8 Rob Cliffe

On Thu, Apr 12, 2012 at 06:37, Ben Finney <ben+python@benfinney.id.au> wrote:
You guys are thinking more deeply about this than I was. I don't even see a difference between the 2:
'foo bar baz'.replace('foo', 'baz').replace('bar', 'baz') == re.sub('foo|bar', 'baz', 'foo bar baz') True
I was not even thinking about ordering, but it would help to have it to avoid confusion I think. The example I gave was just the closest I could think of.

I want multiple replace at once. For example html escape looks like:
"<>&".replace('<', '<', '>', '>', '&', '&') '<>&'
or
"<>&".replace( ('<', '<'), ('>', '>'), ('&', '&') ) '<>&'
On Thu, Apr 12, 2012 at 6:39 PM, Tshepang Lekhonkhobe <tshepang@gmail.com> wrote:
-- INADA Naoki <songofacandy@gmail.com>

INADA Naoki schrieb am Thu, 12. Apr 2012, um 20:32:45 +0900:
"<>&".replace( ('<', '<'), ('>', '>'), ('&', '&') ) '<>&'
In current Python, it's >>> t = str.maketrans({"<": "<", ">": ">", "&": "&"}) >>> "<>&".translate(t) '<>&' Looks good enough for me. Cheers, Sven

INADA Naoki schrieb am Thu, 12. Apr 2012, um 22:17:30 +0900:
Oh, I didn't know that. Thank you. But what about unescape? str.translate accepts only one character key.
You'd currently need to use the `re` module: >>> d = {"&": "&", ">": ">", "<": "<"} >>> re.sub("|".join(d), lambda m: d[m.group()], "<>&") '<>&' Cheers, Sven

INADA Naoki, 12.04.2012 18:32:
Simpler, maybe, at least at the API level. But faster? Not necesarily. It could use Aho-Corasick, but that means it needs to construct the search graph on each call, which is fairly expensive. And str.replace() isn't the right interface for anything but a one-shot operation if the intention is to pass in a sequence of keywords. Stefan

Stefan Behnel wrote:
So maybe a better approach would be to enhance maketrans so that both keys and replacements can be more than one character long? Behind the scenes, it could build a DFA or whatever is needed to do it efficiently. -- Greg

You're right. But in simple situation, overhead of making match object and calling callback is more expensive. (ex. https://gist.github.com/2369648 ) I think chaining replace is not so bad for such simple cases. So a problem is there are no "one obvious way" to replace multiple keywords. On Fri, Apr 13, 2012 at 2:08 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
-- INADA Naoki <songofacandy@gmail.com>

On Apr 11, 2012, at 4:35 PM, Tshepang Lekhonkhobe wrote:
It seems to meet that it is a rare use case to want to replace many things with a single replacement string. I can't remember a single case of ever needing this. This only thing that comes to mind is automated redaction. What I have needed and have seen others need is a dictionary based replace: {'customer': 'client', 'headquarters': 'office', 'now': 'soon'}. Even that case is a fraught with peril -- I would want "now" to change to "soon" but not have "snow" change to "ssoon". In the end, I think want people want is to have the power and control afforded by re.sub() but without having to learn regular expressions. Raymond

On Fri, Apr 13, 2012 at 11:03 AM, Raymond Hettinger <raymond.hettinger@gmail.com> wrote:
There is one very attractive special case, however, which is an invertible translation like URL-escaping (or HTML-escaping), where at least one side of the transform is single characters. Then there is no ambiguity. Nevertheless, I think that case is special enough that it may as well be done in the modules that deal with URLs and HTML respectively.
participants (14)
-
Ben Finney
-
Cameron Simpson
-
Carl M. Johnson
-
Greg Ewing
-
Gregory P. Smith
-
INADA Naoki
-
Mark Lawrence
-
Nick Coghlan
-
Raymond Hettinger
-
Rob Cliffe
-
Stefan Behnel
-
Stephen J. Turnbull
-
Sven Marnach
-
Tshepang Lekhonkhobe