[Python-ideas] str.find() and friends support a lists of inputs

Thu Apr 17 20:52:00 CEST 2014

It's a fairly common problem to want to .find() or .replace() or .split() any one of multiple characters. Currently the go to solution are:

A:
   clean_number = formatted_phone_number.replace('-', '').replace('(', '').replace(')','').replace(' ','')

B:
   get_rid_of = ["-","(",")"," "]
   clean_number = formatted_phone_number
   for ch in get_rid_of:
       clean_number = clean_number.replace(ch,'')

C:
   import re
   clean_number = re.sub('[-\(\) ]', '', formatted_phone_number)

While none of these is especially terrible they're also far from nice or clean. And whenever I'm faced with this kind of problem my automatic reaction is to type:

clean_number = formatted_phone_number.replace(["-","(",")"," "],"")

That is what I intuitively want to do, and it is the syntax other people often use when trying to describe that they want to replace multiple characters. I think this is because its semantics follow very logically from the original replace statement you learn.
Instead of saying "replace this with that" it's saying "replace these with that".

In the case of split() it gets even worse to try and split on multiple delimiters and you almost have to resort to using re. However for such simple cases re is serious overkill. You have to teach people about an entire new module, explain what regular expressions are and explain what new syntax like "(abc|def)" and "[abcdef]" means. When you could just use the string functions and list syntax you already understand.

While re is an absolute life saver in certain situations, it is very non-performant for little one-of operations because it still has to compile a whole regular expression. Below is a quick test in iPython, intentionally bypassing the cache:

In [1]: a = "a"*100+"b"

In [2]: %timeit -n 1 -r 1 a.find('b')
1 loops, best of 1: 3.31 µs per loop

In [3]: import re

In [4]: %%timeit -n 1 -r 1 re.purge()
   ...: re.search('[b]', 'a')
   ...:
1 loops, best of 1: 132 µs per loop

So for all those reasons, this is what I propose. Making .find() support lists of targets, .split() support lists of delimiters and .replace() support lists of targets. The objective of this is not to support all possible permutations of string operations, I expect there are many cases that this will not solve, however it is meant to make the built in string operations support a slightly larger set of very common operations which fit intuitively with the existing syntax.

I'd also like to note what my own concerns were with this idea:
My first concern was that this might break existing code. But a quick check shows that this is invalid syntax at the moment, so it doesn't affect backwards compatibility at all.
My second concern was with handling the possibility of collisions within the list (i.e. "snowflake".replace(['snow', 'snowflake'])) This could be ameliorated by explicitly deciding that whichever match begins earlier will be applied before considering the others and if they start at the same position the one earlier in the list will be resolved first. However, I'd argue that if you really need explicit control over the match order of words which contain each other that's a pretty good time to start considering regular expressions.

Below are a sampling of questions from Stack Overflow which would have benefited from the existence of this syntax.

http://stackoverflow.com/questions/21859203/how-to-replace-multiple-characters-in-a-string
http://stackoverflow.com/questions/4998629/python-split-string-with-multiple-delimiters
http://stackoverflow.com/questions/10017147/python-replace-characters-in-string
http://stackoverflow.com/questions/14215338/python-remove-multiple-character-in-list-of-string

Cheers,

- Alex