[Python-ideas] Complicate str methods
Terry Reedy
tjreedy at udel.edu
Sat Feb 3 18:43:19 EST 2018
On 2/3/2018 5:04 PM, Franklin? Lee wrote:
> Let s be a str. I propose to allow these existing str methods to take
> params in new forms.
Thanks for the honest title. As you sort of indicate, these can all be
done with re module. However, you imply loops are needed besides, which
is mostly not true. Your complications mostly translate to existing
calls and hence are not needed.
Perhaps 'Regular Expression HOWTO' could use more examples, or even a
section on generalizing string methinds. Perhaps the string method doc
needs suggestion to use re for multiple string args and references to
the re howto. Please consider taking a look at both.
>>> import re
> s.replace(old, new):
> Allow passing in a collection of olds.
>>> re.sub('Franklin|Lee', 'user', 'Franklin? Lee')
'user? user'
Remembering the name change is a nuisance
> Allow passing in a single argument, a mapping of olds to news.
This needs to be a separate function, say 'dictsub', that joins the keys
with '|' and calls re.sub with a function that does the lookup as the
2nd parameter. This would be a nice example for the howto.
As you noted, this is generalization of str.translate, and might be
proposed as a new re module function.
> Allow the olds in the mapping to be tuples of strings.
A minor addition to dictsub.
> s.split(sep), s.rsplit, s.partition:
> Allow sep to be a collection of separators.
re.split is already more flexible than non-whitespace str.split and
str.partition combined.
>>> re.split('a|e|i|o|u', 'Franklin? Lee')
['Fr', 'nkl', 'n? L', '', '']
>>> re.split('(a|e|i|o|u)', 'Franklin? Lee') # multiple partition
['Fr', 'a', 'nkl', 'i', 'n? L', 'e', '', 'e', '']
>>> re.split('(a|e|i|o|u)', 'Franklin? Lee', 1) # single partition
['Fr', 'a', 'nklin? Lee']
re.split, and hence str.rsplit(collection) are very sensible.
> s.startswith, s.endswith:
> Allow argument to be a collection of strings.
bool(re.match('|'.join(strings)) does exactly the proposed s.startswith,
with the advantage that the actual match is available, and I think that
one would nearly always want to know that match.
>>> re.match('a|e|i|o|u', 'Franklin? Lee')
>>> re.match('f|F', 'Franklin? Lee')
<re.Match object; span=(0, 1), match='F'>
re.search with '^' at the beginning or '$' at the end covers both
proposals, with the added flexibility of using MULTILINE mode to match
at the beginning or end of lines within the string.
> s.find, s.index, s.count, x in s:
> Similar.
> These methods are also in `list`, which can't distinguish between
> items, subsequences, and subsets. However, `str` is already inconsistent
> with `list` here: list.M looks for an item, while str.M looks for a
> subsequence.
Comments above apply. re.search tells you which string matched as well
as where. bool(re.search) is 'x in s'. re.findall and re.finditer give
much more info than merely a count ('sum(bool(re.finditer))').
> s.[r|l]strip:
> Sadly, these functions already interpret their str arguments as
> collections of characters.
To avoid this, use re.sub with ^ or $ anchor and '' replacement.
>>> re.sub('(Frank|Lee)$', '', 'Franklin? Lee')
'Franklin? '
> These new forms can be optimized internally, as a search for multiple
> candidate substrings can be more efficient than searching for one at a
> time.
This is what re does with 's1|s2|...|sn' patterns.
> https://stackoverflow.com/questions/3260962/algorithm-to-find-multiple-string-matches
>
> The most significant change is on .replace. The others are simple enough
> to simulate with a loop or something.
No loops needed.
> It is harder to make multiple
> simultaneous replacements using one .replace at a time, because previous
> replacements can form new things that look like replaceables.
This problem exists for single string replacement also. The standard
solution is to not backtrack and not do overlapping replacements.
> The easiest Python solution is to use regex
My claim above is that this is sufficient for all by one case, which
should be a new function anyway.
> or install some package, which
> uses (if you're lucky) regex or (if unlucky) doesn't simulate
> simultaneous replacements. (If possible, just use str.translate.)
>
> I suppose .split on multiple separators is also annoying to simulate.
> The two-argument form of .split may be even more of a burden, though I
> don't know when a limited multiple-separator split is useful. The
> current best solution is, like before, to use regex, or install a
> package and hope for the best.
--
Terry Jan Reedy
More information about the Python-ideas
mailing list