Mailman 3 str.find() and friends support a lists of inputs - Python-ideas

newer
Re: [Python-ideas] py launcher for...

str.find() and friends support a lists of inputs

Alex Rodrigues

17 Apr 2014 17 Apr '14

6:52 p.m.

It's a fairly common problem to want to .find() or .replace() or .split() any one of multiple characters. Currently the go to solution are: A: clean_number = formatted_phone_number.replace('-', '').replace('(', '').replace(')','').replace(' ','') B: get_rid_of = ["-","(",")"," "] clean_number = formatted_phone_number for ch in get_rid_of: clean_number = clean_number.replace(ch,'') C: import re clean_number = re.sub('[-\(\) ]', '', formatted_phone_number) While none of these is especially terrible they're also far from nice or clean. And whenever I'm faced with this kind of problem my automatic reaction is to type: clean_number = formatted_phone_number.replace(["-","(",")"," "],"") That is what I intuitively want to do, and it is the syntax other people often use when trying to describe that they want to replace multiple characters. I think this is because its semantics follow very logically from the original replace statement you learn. Instead of saying "replace this with that" it's saying "replace these with that". In the case of split() it gets even worse to try and split on multiple delimiters and you almost have to resort to using re. However for such simple cases re is serious overkill. You have to teach people about an entire new module, explain what regular expressions are and explain what new syntax like "(abc|def)" and "[abcdef]" means. When you could just use the string functions and list syntax you already understand. While re is an absolute life saver in certain situations, it is very non-performant for little one-of operations because it still has to compile a whole regular expression. Below is a quick test in iPython, intentionally bypassing the cache: In [1]: a = "a"*100+"b" In [2]: %timeit -n 1 -r 1 a.find('b') 1 loops, best of 1: 3.31 µs per loop In [3]: import re In [4]: %%timeit -n 1 -r 1 re.purge() ...: re.search('[b]', 'a') ...: 1 loops, best of 1: 132 µs per loop So for all those reasons, this is what I propose. Making .find() support lists of targets, .split() support lists of delimiters and .replace() support lists of targets. The objective of this is not to support all possible permutations of string operations, I expect there are many cases that this will not solve, however it is meant to make the built in string operations support a slightly larger set of very common operations which fit intuitively with the existing syntax. I'd also like to note what my own concerns were with this idea: My first concern was that this might break existing code. But a quick check shows that this is invalid syntax at the moment, so it doesn't affect backwards compatibility at all. My second concern was with handling the possibility of collisions within the list (i.e. "snowflake".replace(['snow', 'snowflake'])) This could be ameliorated by explicitly deciding that whichever match begins earlier will be applied before considering the others and if they start at the same position the one earlier in the list will be resolved first. However, I'd argue that if you really need explicit control over the match order of words which contain each other that's a pretty good time to start considering regular expressions. Below are a sampling of questions from Stack Overflow which would have benefited from the existence of this syntax. http://stackoverflow.com/questions/21859203/how-to-replace-multiple-characte... http://stackoverflow.com/questions/4998629/python-split-string-with-multiple... http://stackoverflow.com/questions/10017147/python-replace-characters-in-str... http://stackoverflow.com/questions/14215338/python-remove-multiple-character... Cheers, - Alex

Show replies by thread

Andrew Barnert

17 Apr 17 Apr

8:14 p.m.

On Apr 17, 2014, at 11:52, Alex Rodrigues wrote:

...

It's a fairly common problem to want to .find() or .replace() or .split() any one of multiple characters.

I like your solution, except for one thing. Explicitly requiring a list of arguments rather than, say, a tuple or an iterator, seems unnecessarily restrictive. However, allowing any iterable of strings doesn't work because a string is itself an iterable of strings. There are a few cases where Python deals with this problem by treating tuples specially (e.g., % formatting), but I don't think anyone wants to extend that solution. You could almost handle these problems with *args (replace replaces any of args[:-1] with args[-1]), except that all of them have optional parameters at the end. You could have a keyword-only argument to specify an iterable of strings (in which case you can't use any positional arguments), but that's a pretty weird interface. Or you could just add some new methods: split_any, replace_any, etc. But str already has a lot of methods; do we really want more? I'd love to see an answer that works here, because I agree that it would make a lot of code simpler, and especially code that novices want to write.

Ryan Hiebert

8:28 p.m.

I like the idea. I agree with the assertion that has been discussed previously that a string really shouldn't be iterable. Because of that, I think that explicitly checking if it is str, and if not, using it as an iterator, would be appropriate. On Thu, Apr 17, 2014 at 3:14 PM, Andrew Barnert < abarnert@yahoo.com.dmarc.invalid> wrote:

...

On Apr 17, 2014, at 11:52, Alex Rodrigues wrote:

...
It's a fairly common problem to want to .find() or .replace() or .split() any one of multiple characters.

I like your solution, except for one thing. Explicitly requiring a list of arguments rather than, say, a tuple or an iterator, seems unnecessarily restrictive. However, allowing any iterable of strings doesn't work because a string is itself an iterable of strings.

There are a few cases where Python deals with this problem by treating tuples specially (e.g., % formatting), but I don't think anyone wants to extend that solution.

You could almost handle these problems with *args (replace replaces any of args[:-1] with args[-1]), except that all of them have optional parameters at the end.

You could have a keyword-only argument to specify an iterable of strings (in which case you can't use any positional arguments), but that's a pretty weird interface.

Or you could just add some new methods: split_any, replace_any, etc. But str already has a lot of methods; do we really want more?

I'd love to see an answer that works here, because I agree that it would make a lot of code simpler, and especially code that novices want to write. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Andrew Barnert

9:49 p.m.

On Apr 17, 2014, at 13:28, Ryan Hiebert wrote:

...

I like the idea. I agree with the assertion that has been discussed previously that a string really shouldn't be iterable.

I don't want to reopen that whole argument here, except to say that there are strong cases for both sides and nobody's managed to convince everyone to their side. At any rate, even if you don't think a string should be an iterable, it definitely _is_ one in the language as it is today.

...

Because of that, I think that explicitly checking if it is str, and if not, using it as an iterator, would be appropriate.

You mean as an iterable, not as an iterator, right? But anyway, that would be very odd. Is there anywhere else in the stdlib that strings are treated specially and not iterated over? (As I mentioned before, there are cases where tuples are treated specially, but I don't think anyone considers that a good thing, or wants to take it any further.) Also, you have to be careful about what you mean by "string". Does that mean str, str or any subclass, anything that quacks like a string in a certain context, ...? And then, what's the parallel definition for bytes and bytearray methods? (Maybe there should be ABCs for String, MutableString, ByteString, and MutableByteString to provide a better answer to those questions?)

...

On Thu, Apr 17, 2014 at 3:14 PM, Andrew Barnert wrote:

...
On Apr 17, 2014, at 11:52, Alex Rodrigues wrote:

...
It's a fairly common problem to want to .find() or .replace() or .split() any one of multiple characters.

I like your solution, except for one thing. Explicitly requiring a list of arguments rather than, say, a tuple or an iterator, seems unnecessarily restrictive. However, allowing any iterable of strings doesn't work because a string is itself an iterable of strings.

There are a few cases where Python deals with this problem by treating tuples specially (e.g., % formatting), but I don't think anyone wants to extend that solution.

You could almost handle these problems with *args (replace replaces any of args[:-1] with args[-1]), except that all of them have optional parameters at the end.

You could have a keyword-only argument to specify an iterable of strings (in which case you can't use any positional arguments), but that's a pretty weird interface.

Or you could just add some new methods: split_any, replace_any, etc. But str already has a lot of methods; do we really want more?

I'd love to see an answer that works here, because I agree that it would make a lot of code simpler, and especially code that novices want to write. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Ryan Hiebert

10:21 p.m.

On Thu, Apr 17, 2014 at 4:49 PM, Andrew Barnert wrote:

...

On Apr 17, 2014, at 13:28, Ryan Hiebert wrote:

Because of that, I think that explicitly checking if it is str, and if not, using it as an iterator, would be appropriate.

You mean as an iterable, not as an iterator, right?

Yes.

...

But anyway, that would be very odd. Is there anywhere else in the stdlib that strings are treated specially and not iterated over? (As I mentioned before, there are cases where tuples are treated specially, but I don't think anyone considers that a good thing, or wants to take it any further.)

Also, you have to be careful about what you mean by "string". Does that mean str, str or any subclass, anything that quacks like a string in a certain context, ...? And then, what's the parallel definition for bytes and bytearray methods? (Maybe there should be ABCs for String, MutableString, ByteString, and MutableByteString to provide a better answer to those questions?)

How about defining it as "anything that currently works in those contexts"? Currently, it allows str and subclasses, but not bytes, because they cannot be converted implicitly to str. So, try it how it works now, and if that fails, try using it as an iterable. Are there similar methods on bytes which might make that definition confusing?

Chris Angelico

11:05 p.m.

On Fri, Apr 18, 2014 at 4:52 AM, Alex Rodrigues wrote:

...

Below is a quick test in iPython, intentionally bypassing the cache:

In [1]: a = "a"*100+"b"

In [2]: %timeit -n 1 -r 1 a.find('b') 1 loops, best of 1: 3.31 µs per loop

In [3]: import re

In [4]: %%timeit -n 1 -r 1 re.purge() ...: re.search('[b]', 'a') ...: 1 loops, best of 1: 132 µs per loop

I'm always dubious of micro-benchmarks, especially when caches have to be deliberately bypassed. How does the time compare if you *don't* purge the cache? After all, compiling an RE once and using it lots of times is exactly how they're meant to be used. Yes, it would be potentially cleaner to offer a list of strings to .find(); but maybe reaching for a regex is the right thing to do. Last night I wanted to rename a whole bunch of files thus: "DoYouWannaBuildASnowman.mkv" -> "Frozen - Do You Wanna Build A Snowman.mkv". Constant text at the beginning, then add a space before every capital letter. Heretical or not, I went regex. :) ChrisA

Alex Rodrigues

11:14 p.m.

Currently .endswith() and .startswith() accept a str, unicode, or tuple and use the tuple exactly in the same way this would. That might not be a bad place to start when thinking about which types to support. Date: Thu, 17 Apr 2014 17:21:07 -0500 Subject: Re: [Python-ideas] str.find() and friends support a lists of inputs From: ryan@ryanhiebert.com To: abarnert@yahoo.com CC: abarnert@yahoo.com.dmarc.invalid; lemiant@hotmail.com; python-ideas@python.org On Thu, Apr 17, 2014 at 4:49 PM, Andrew Barnert wrote: On Apr 17, 2014, at 13:28, Ryan Hiebert wrote: Because of that, I think that explicitly checking if it is str, and if not, using it as an iterator, would be appropriate. You mean as an iterable, not as an iterator, right? Yes. But anyway, that would be very odd. Is there anywhere else in the stdlib that strings are treated specially and not iterated over? (As I mentioned before, there are cases where tuples are treated specially, but I don't think anyone considers that a good thing, or wants to take it any further.) Also, you have to be careful about what you mean by "string". Does that mean str, str or any subclass, anything that quacks like a string in a certain context, ...? And then, what's the parallel definition for bytes and bytearray methods? (Maybe there should be ABCs for String, MutableString, ByteString, and MutableByteString to provide a better answer to those questions?) How about defining it as "anything that currently works in those contexts"? Currently, it allows str and subclasses, but not bytes, because they cannot be converted implicitly to str. So, try it how it works now, and if that fails, try using it as an iterable. Are there similar methods on bytes which might make that definition confusing?

MRAB

11:16 p.m.

On 2014-04-17 21:14, Andrew Barnert wrote:

...

On Apr 17, 2014, at 11:52, Alex Rodrigues wrote:

...
It's a fairly common problem to want to .find() or .replace() or .split() any one of multiple characters.

I like your solution, except for one thing. Explicitly requiring a list of arguments rather than, say, a tuple or an iterator, seems unnecessarily restrictive. However, allowing any iterable of strings doesn't work because a string is itself an iterable of strings.

There are a few cases where Python deals with this problem by treating tuples specially (e.g., % formatting), but I don't think anyone wants to extend that solution.

str.startswith and str.endswith accept either a string or a tuple of strings, but not a list of strings nor any other iterable, so does it matter if str.find, etc, accepted a tuple but not a list?

...

You could almost handle these problems with *args (replace replaces any of args[:-1] with args[-1]), except that all of them have optional parameters at the end.

You could have a keyword-only argument to specify an iterable of strings (in which case you can't use any positional arguments), but that's a pretty weird interface.

Or you could just add some new methods: split_any, replace_any, etc. But str already has a lot of methods; do we really want more?

I'd love to see an answer that works here, because I agree that it would make a lot of code simpler, and especially code that novices want to write.

Greg Ewing

18 Apr 18 Apr

12:44 a.m.

Andrew Barnert wrote:

...

There are a few cases where Python deals with this problem by treating tuples specially (e.g., % formatting), but I don't think anyone wants to extend that solution.

The startswith() and endswith() methods already accept a tuple in place of a string, and require it to be a tuple. So I think it would be entirely reasonable to do the same for replace(). -- Greg

Terry Reedy

1:31 a.m.

On 4/17/2014 2:52 PM, Alex Rodrigues wrote:

...

It's a fairly common problem to want to .find() or .replace() or .split() any one of multiple characters. Currently the go to solution are:

For replace, you left out the actual solution.

...

...
...
telnum = '(800), 555-1234' telnum.translate(str.maketrans('','','(- ,)')) '8005551234'

For finding any of multiple chars, a loop that does just what you want is easy. targets = '(- ,)' for i, c in enumerate(s): if c in targets: break else: <whatever you want for not found>

Terry Reedy

1:46 a.m.

On 4/17/2014 4:28 PM, Ryan Hiebert wrote:

...

I like the idea.

The idea depends on being able to iterate through strings.

...

I agree with the assertion that has been discussed previously that a string really shouldn't be iterable.

...except when they should be. This assertion is off-topic for python-ideas. If you bring it up on python-list, I will say more. -- Terry Jan Reedy

Steven D'Aprano

1:49 a.m.

On Thu, Apr 17, 2014 at 01:14:23PM -0700, Andrew Barnert wrote:

...

On Apr 17, 2014, at 11:52, Alex Rodrigues wrote:

...
It's a fairly common problem to want to .find() or .replace() or .split() any one of multiple characters.

I like your solution, except for one thing. Explicitly requiring a list of arguments rather than, say, a tuple or an iterator, seems unnecessarily restrictive. However, allowing any iterable of strings doesn't work because a string is itself an iterable of strings.

There are a few cases where Python deals with this problem by treating tuples specially (e.g., % formatting), but I don't think anyone wants to extend that solution.

I do! That makes the decision really simple: if the argument is a tuple, it is treated as multiple values, otherwise it is treated as a single value. That's how other string methods operate: py> 'abcd'.startswith(('xyz', 'abc')) True py> 'abcd'.startswith(['xyz', 'abc']) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: startswith first arg must be str or a tuple of str, not list so it's quite easy to learn, with no concerns about whether or not the argument will accept a set or a dict or iterators... -- Steven

Steven D'Aprano

2:22 a.m.

On Thu, Apr 17, 2014 at 09:31:24PM -0400, Terry Reedy wrote:

...

On 4/17/2014 2:52 PM, Alex Rodrigues wrote:

...
It's a fairly common problem to want to .find() or .replace() or .split() any one of multiple characters. Currently the go to solution are:

For replace, you left out the actual solution.

...
...
...
telnum = '(800), 555-1234' telnum.translate(str.maketrans('','','(- ,)')) '8005551234'

That solution only works when you want to replace single characters. It doesn't help to replace generic substrings: "That's one small step for a man, one giant leap for mankind.".replace( 'man', 'person') Naively looping over your input may not work. for term in (a, b, c): s = s.replace(term, x) is not always the same as doing the replacements in a single pass. It seems like it ought to be the same, until you run into a situation like this: py> recipe = "Add one capsicum to the stew..." py> for term in ("capsicum", "chilli pepper", "pepper"): ... recipe = recipe.replace(term, "bell pepper") ... py> print(recipe) Add one bell bell pepper to the stew... Oops! If the search terms are not known until runtime, you may have a lot of difficulty doing the replacements in an order that doesn't cause problems like this. There are ways around this problem, but they're tricky to get right. Although this is easily done using a regex, it does require the user learn about regexes, which may be overkill. They have a steep learning curve and can be intimidating to beginners. In order of preference, I'd prefer: +1 allow str.find, .index and .replace to take tuple arguments to search for multiple substrings in a single pass; +0.5 to add helper functions in the string module findany replaceany. (No need for an indexany.) -0 tell the user to "just use a regex". (Perhaps we could give a couple of regex recipes in the Python FAQ or the re docs?) -- Steven

Terry Reedy

4:44 a.m.

On 4/17/2014 10:22 PM, Steven D'Aprano wrote:

...

On Thu, Apr 17, 2014 at 09:31:24PM -0400, Terry Reedy wrote:

...
On 4/17/2014 2:52 PM, Alex Rodrigues wrote:

...
It's a fairly common problem to want to .find() or .replace() or .split() any one of multiple characters. Currently the go to solution are:

For replace, you left out the actual solution.

...
...
...
telnum = '(800), 555-1234' telnum.translate(str.maketrans('','','(- ,)')) '8005551234'

That solution only works when you want to replace single characters.

That was the problem Alex presented and solved 3 ways other than the above.

...

It doesn't help to replace generic substrings:

This is definite a harder problem. The easiest solution would be import re def replace_multiple_targets(string, target_tuple, replacement): pattern = '|'.join(target_tuple) return re.sub(pattern, replacement, string) print(replace_multiple_targets("Add one capsicum to the stew...", ("capsicum", "chilli pepper", "pepper"), "bell pepper") ) # Add one bell pepper to the stew...

...

Although this is easily done using a regex, it does require the user learn about regexes, which may be overkill. They have a steep learning curve and can be intimidating to beginners.

On of the problems is the mismatch of apis. str.replace(string, pattern, repl[, count]) # versus re.sub(pattern, repl, string, count=0, flags=0) The function/method name is different, the parameter order is different (partly for good reason), and the count default is different (at least as presented). Ugh. I can never remember this and use help each time, at least for the re version. The other reason Alex does not like re is that it is 'slow'. In this case, the medium difficulty problem (see below), matching any of multiple strings, has medium speed solutions. https://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm https://en.wikipedia.org/wiki/Rabin-Karp_algorithm

...

In order of preference, I'd prefer:

+1 allow str.find, .index and .replace to take tuple arguments to search for multiple substrings in a single pass;

+ delta

...

+0.5 to add helper functions in the string module findany replaceany. (No need for an indexany.)

- something

...

-0 tell the user to "just use a regex".

+ epsilon

...

(Perhaps we could give a couple of regex recipes in the Python FAQ or the re docs?)

Like the above? My worry about the request is this. Python has many pairs of functions with one being a slow generic version and the other being a fast special case version. The fast and slow paths within the interpreter are one example. The str versus re pairs are another. There are often in-between, more generic but not fully, medium speed versions, perhaps more than one. How many of these should be provide? My worry is making the language overall harder to maintain and/or use by providing a multitude of in-between functions. In this case, the ease of simply allowing tuples as an alternative input, with a fairly obvious meaning, is a plus. An addition to your list is this. Add an new msm (multiple string match) module that would expose one (or both?) of the algorithms above in a match function and provide methods or functions like those in the str and re modules. I would be tempted to use the str versions of the apis as possible, and omit a compile function and the corresponding methods. (The hidden cache should be enough to avoid constanct recompiles.) -- Terry Jan Reedy

Ron Adam

19 Apr 19 Apr

1:40 a.m.

On 04/17/2014 10:22 PM, Steven D'Aprano wrote:

...

On Thu, Apr 17, 2014 at 09:31:24PM -0400, Terry Reedy wrote:

...
...
On 4/17/2014 2:52 PM, Alex Rodrigues wrote:

...
...
It's a fairly common problem to want to .find() or .replace() or .split() any one of multiple characters. Currently the go to solution are:

For replace, you left out the actual solution.

...
...
...
>>>telnum = '(800), 555-1234' >>>telnum.translate(str.maketrans('','','(- ,)')) '8005551234' That solution only works when you want to replace single characters. It doesn't help to replace generic substrings:

"That's one small step for a man, one giant leap for mankind.".replace( 'man', 'person')

Naively looping over your input may not work.

for term in (a, b, c): s = s.replace(term, x)

is not always the same as doing the replacements in a single pass. It seems like it ought to be the same, until you run into a situation like this:

py> recipe = "Add one capsicum to the stew..." py> for term in ("capsicum", "chilli pepper", "pepper"): ... recipe = recipe.replace(term, "bell pepper") ... py> print(recipe) Add one bell bell pepper to the stew...

Oops! If the search terms are not known until runtime, you may have a lot of difficulty doing the replacements in an order that doesn't cause problems like this. There are ways around this problem, but they're tricky to get right.

Possible start by allowing the sep argument for str.partition accept a tuple. recipe, _recipe = "", recipe while _recipe: head, sep, _recipe = _recipe.partition( ("capsium", "Chilli pepper", "pepper")) if sep: recipe = "".join([recipe, head, "bell pepper"]) else: recipe = "".join([recipe, head]) Then maybe the other methods can use str.partition to do the work. Cheers, Ron

3659

Age (days ago)

3661

Last active (days ago)

List overview

Download

14 comments

9 participants

participants (9)

Alex Rodrigues
Andrew Barnert
Chris Angelico
Greg Ewing
MRAB
Ron Adam
Ryan Hiebert
Steven D'Aprano
Terry Reedy

str.find() and friends support a lists of inputs

Alex Rodrigues

Alex Rodrigues

tags

participants (9)