[Python-ideas] str.find() and friends support a lists of inputs

Fri Apr 18 06:44:46 CEST 2014

On 4/17/2014 10:22 PM, Steven D'Aprano wrote:
> On Thu, Apr 17, 2014 at 09:31:24PM -0400, Terry Reedy wrote:
>> On 4/17/2014 2:52 PM, Alex Rodrigues wrote:
>>> It's a fairly common problem to want to .find() or .replace() or .split()
>>> any one of multiple characters.
>>> Currently the go to solution are:
>>
>> For replace, you left out the actual solution.
>>
>>>>> telnum = '(800), 555-1234'
>>>>> telnum.translate(str.maketrans('','','(- ,)'))
>> '8005551234'
>
> That solution only works when you want to replace single characters.

That was the problem Alex presented and solved 3 ways other than the above.

 > It doesn't help to replace generic substrings:

This is definite a harder problem. The easiest solution would be

import re

def replace_multiple_targets(string, target_tuple, replacement):
     pattern = '|'.join(target_tuple)
     return re.sub(pattern, replacement, string)

print(replace_multiple_targets("Add one capsicum to the stew...",
                                ("capsicum", "chilli pepper", "pepper"),
                                "bell pepper") )
#
Add one bell pepper to the stew...

> Although this is easily done using a regex, it does require the user
> learn about regexes, which may be overkill. They have a steep learning
> curve and can be intimidating to beginners.

On of the problems is the mismatch of apis.

str.replace(string, pattern, repl[, count])  # versus
re.sub(pattern, repl, string, count=0, flags=0)

The function/method name is different, the parameter order is different 
(partly for good reason), and the count default is different (at least 
as presented). Ugh. I can never remember this and use help each time, at 
least for the re version.

The other reason Alex does not like re is that it is 'slow'. In this 
case, the medium difficulty problem (see below), matching any of 
multiple strings, has medium speed solutions.
https://en.wikipedia.org/wiki/Aho-Corasick_string_matching_algorithm
https://en.wikipedia.org/wiki/Rabin-Karp_algorithm

> In order of preference, I'd prefer:
>
> +1 allow str.find, .index and .replace to take tuple arguments to search
> for multiple substrings in a single pass;

+ delta

> +0.5 to add helper functions in the string module findany replaceany.
> (No need for an indexany.)

- something

> -0 tell the user to "just use a regex".

+ epsilon

> (Perhaps we could give a couple of regex recipes in the Python FAQ or
> the re docs?)

Like the above?

My worry about the request is this. Python has many pairs of functions 
with one being a slow generic version and the other being a fast special 
case version. The fast and slow paths within the interpreter are one 
example.  The str versus re pairs are another. There are often 
in-between, more generic but not fully, medium speed versions, perhaps 
more than one. How many of these should be provide? My worry is making 
the language overall harder to maintain and/or use by providing a 
multitude of in-between functions.

In this case, the ease of simply allowing tuples as an alternative 
input, with a fairly obvious meaning, is a plus.

An addition to your list is this. Add an new msm (multiple string match) 
module that would expose one (or both?) of the algorithms above in a 
match function and provide methods or functions like those in the str 
and re modules. I would be tempted to use the str versions of the apis 
as possible, and omit a compile function and the corresponding methods. 
(The hidden cache should be enough to avoid constanct recompiles.)

-- 
Terry Jan Reedy