[Python-ideas] Complicate str methods

Sat Feb 3 18:43:19 EST 2018

On 2/3/2018 5:04 PM, Franklin? Lee wrote:
> Let s be a str. I propose to allow these existing str methods to take 
> params in new forms.

Thanks for the honest title.  As you sort of indicate, these can all be 
done with re module.  However, you imply loops are needed besides, which 
is mostly not true.  Your complications mostly translate to existing 
calls and hence are not needed.

Perhaps 'Regular Expression HOWTO' could use more examples, or even a 
section on generalizing string methinds.  Perhaps the string method doc 
needs suggestion to use re for multiple string args and references to 
the re howto.  Please consider taking a look at both.

 >>> import re

> s.replace(old, new):
>      Allow passing in a collection of olds.

 >>> re.sub('Franklin|Lee', 'user', 'Franklin? Lee')
'user? user'

Remembering the name change is a nuisance

>      Allow passing in a single argument, a mapping of olds to news.

This needs to be a separate function, say 'dictsub', that joins the keys 
with '|' and calls re.sub with a function that does the lookup as the 
2nd parameter.  This would be a nice example for the howto.

As you noted, this is generalization of str.translate, and might be 
proposed as a new re module function.

>      Allow the olds in the mapping to be tuples of strings.

A minor addition to dictsub.

> s.split(sep), s.rsplit, s.partition:
>      Allow sep to be a collection of separators.

re.split is already more flexible than non-whitespace str.split and 
str.partition combined.

 >>> re.split('a|e|i|o|u', 'Franklin? Lee')
['Fr', 'nkl', 'n? L', '', '']
 >>> re.split('(a|e|i|o|u)', 'Franklin? Lee')  # multiple partition
['Fr', 'a', 'nkl', 'i', 'n? L', 'e', '', 'e', '']
 >>> re.split('(a|e|i|o|u)', 'Franklin? Lee', 1) # single partition
['Fr', 'a', 'nklin? Lee']

re.split, and hence str.rsplit(collection) are very sensible.

> s.startswith, s.endswith:
>      Allow argument to be a collection of strings.

bool(re.match('|'.join(strings)) does exactly the proposed s.startswith, 
with the advantage that the actual match is available, and I think that 
one would nearly always want to know that match.

 >>> re.match('a|e|i|o|u', 'Franklin? Lee')
 >>> re.match('f|F', 'Franklin? Lee')
<re.Match object; span=(0, 1), match='F'>

re.search with '^' at the beginning or '$' at the end covers both 
proposals, with the added flexibility of using MULTILINE mode to match 
at the beginning or end of lines within the string.

> s.find, s.index, s.count, x in s:
>      Similar.
>      These methods are also in `list`, which can't distinguish between 
> items, subsequences, and subsets. However, `str` is already inconsistent 
> with `list` here: list.M looks for an item, while str.M looks for a 
> subsequence.

Comments above apply.  re.search tells you which string matched as well 
as where.  bool(re.search) is 'x in s'.  re.findall and re.finditer give 
much more info than merely a count ('sum(bool(re.finditer))').

> s.[r|l]strip:
>      Sadly, these functions already interpret their str arguments as 
> collections of characters.

To avoid this, use re.sub with ^ or $ anchor and '' replacement.

 >>> re.sub('(Frank|Lee)$', '', 'Franklin? Lee')
'Franklin? '

> These new forms can be optimized internally, as a search for multiple 
> candidate substrings can be more efficient than searching for one at a 
> time.

This is what re does with 's1|s2|...|sn' patterns.

> https://stackoverflow.com/questions/3260962/algorithm-to-find-multiple-string-matches
> 
> The most significant change is on .replace. The others are simple enough 
> to simulate with a loop or something.

No loops needed.

> It is harder to make multiple 
> simultaneous replacements using one .replace at a time, because previous 
> replacements can form new things that look like replaceables.

This problem exists for single string replacement also.  The standard 
solution is to not backtrack and not do overlapping replacements.

> The easiest Python solution is to use regex

My claim above is that this is sufficient for all by one case, which 
should be a new function anyway.

> or install some package, which 
> uses (if you're lucky) regex or (if unlucky) doesn't simulate 
> simultaneous replacements. (If possible, just use str.translate.)
> 
> I suppose .split on multiple separators is also annoying to simulate. 
> The two-argument form of .split may be even more of a burden, though I 
> don't know when a limited multiple-separator split is useful. The 
> current best solution is, like before, to use regex, or install a 
> package and hope for the best.

-- 
Terry Jan Reedy