[Python-ideas] New explicit methods to trim strings
steve at pearwood.info
Sun Mar 31 00:43:25 EDT 2019
On Sun, Mar 31, 2019 at 03:05:59AM +0900, Stephen J. Turnbull wrote:
> Steven D'Aprano writes:
> > The correct solution is a verbose statement:
> > if string.startswith("spam"):
> > string = string[:len("spam")]
> This is harder to write than I thought! (The slice should be
> 'len("spam"):'.) But s/The/A/:
> string = re.sub("^spam", "", string)
Indeed, you're right that there can be other solutions, but whether they
are "correct" depends on how one defines correct :-)
I don't consider something that pulls in the heavy bulldozer of regexes
to crack this peanut to be the right way to solve the problem, but YMMV.
But for what it's worth, a regex solution is likely to be significantly
slower -- see below.
> And a slightly incorrect solution (unless you really do want to remove
> all spam, which most people do, but might not apply to "tooth"):
> string = string.replace("spam", "")
Sorry, that's not "slightly" incorrect, that is completely incorrect,
for precisely the reason you state: it replaces *all* matching
substrings, not just the leading prefix.
I don't see a way to easily use replace to implement a prefix cut. I
supose one might do:
string = string[:-len(suffix)] + string[-len(suffix):].replace(suffix, '')
but I haven't tried it and it sure isn't what I would call easy or
> > A pair of "cut" methods (cut prefix, cut suffix) fills a real need,
> But do they, really? Do we really need multiple new methods to
> replace a dual-use one-liner, which also handles
> outfile = re.sub("\\.bmp$", ".jpg", infile)
Solutions based on regexes are far less discoverable:
- all those people who have reported "bugs" in lstrip() and rstrip()
could have thought of using a regex instead but didn't;
- they involve reading what is effectively another programming language
which uses cryptic symbols like "$" instead of words like "suffix".
We aren't the Perl community where regexes are the first hammer we reach
for every time we need to drive a screw :-)
I had to read your re.sub() call twice before I convinced myself that it
only replaced a suffix. And we also have to deal with the case where
we want to delete a substring containing metacharacters:
re.sub(r'\\\.\$$', '', string) # cut literal \.$ suffix
Additionally, a regex solution is likely to be slower than even a
pure-Python solution, let alone a string method. On my computer, regexes
are three times slower than a Python function:
$ python3.5 -m timeit -s "import re" "re.sub('eese$', '', 'spam eggs cheese')"
100000 loops, best of 3: 3.75 usec per loop
$ python3.5 -m timeit -s "def rcut(suff, s): return s[:-len(suff)] if s.endswith(suff) else s" "rcut('eese', 'spam eggs cheese')"
1000000 loops, best of 3: 1.22 usec per loop
> in one line? I concede that the same argument was made against
> startswith/endswith, and they cleared the bar. Python is a lot more
> complex now, though, and I think the predicates are more frequently
> > and will avoid a lot of mistaken bug reports/questions.
> That depends on analogies to other languages.
I don't think it matters that much.
Of course it doesn't help if you come to Python from a language where
strip() deletes a prefix or suffix, but even if you don't, as I don't,
there's something about the pattern:
string = string.lstrip("spam")
which looks like it ought to remove a prefix rather than a set of
characters. I've fallen for that error myself.
> Coming from Emacs, I'm
> not at all surprised that .strip takes a character class as an
> argument and strips until it runs into a character not in the class.
And neither am I... until I forget, and get surprised that it doesn't
work that way.
This post is already too long, so in the interest of brevity and my
dignity I'll skip the anecdote about the time I too blundered publicly
about the "bug" in [lr]strip.
> Evidently others have different intuition. If that's from English,
> and they know about cutprefix/cutsuffix, yeah, they won't make the
> mistake. If it's from another programming language they know, or they
> don't know about cutprefix, they may just write "string.strip('.jpg')"
> without thinking about it and it (sometimes) works, then they report a
> bug when it doesn't. Remember, these folks are not understanding the
> docs, and very likely not reading them.
Its not reasonable to hold the failure of the proposed new methods
to prevent *all* erroneous uses of [lr]strip against them.
Short of breaking backwards compatibility and changing strip() to
there's always going to be *someone* who makes a mistake. But with an
easily discoverable alternative available, the number of such errors
should plummett as people gradually migrate to 3.8 or above.
> > As for the disruption,
> The word is "complexity". Where do you get "disruption" from?
If you had read the text I quoted before trimming it, you would have
seen that it was from Chris Barker:
On Fri, Mar 29, 2019 at 04:05:55PM -0700, Christopher Barker wrote:
> This proposal would provide a minor gain for an even more minor disruption.
I try very hard to provide enough context that my comments are
understandable, and I don't always succeed, but the reader has to meet
me part way by at least skimming the quoted text for context before
questioning me :-)
> > code is a cost, but there is also the uncounted opportunity cost of
> > *not* adding this useful battery.
> Obviously some people think it's useful. Nobody denies that.
Well, further on you do question whether it meets a real need, so there
is at least one :-)
> problem is *measuring* the opportunity cost of not having the battery,
> or the "usefulness" of the battery, as well as measuring the cost of
We have never objectively measured these things before, because they
can't be. We don't even have a good, objective measurement of complexity
of the language -- but if we did, I'm pretty sure that adding a pair of
fairly simple, self-explanatory methods to the str class would not
increase it by much.
We're on steadier ground if we talk about complexity of the user's code.
In that case, whether we measure the complexity of a program by lines of
code or number of functions or some other more complicated measurement,
it ought to be self-evident that being able to replace a helper function
with a built-in will slightly reduce complexity.
For the sake of the argument, if we can decrease the complexity of a
thousand user programs by 1 LOC each, at the cost of increasing the
complexity of the interpreter by 100 LOC, isn't that a cost worth
paying? I think it is.
> Please stop caricaturing those who oppose the change as
That's a grossly unjust misrepresentation of my arguments.
Nothing I have said can be fairly read as a caricature of the opposing
point, let alone as attacks on others for being Luddites. On the
contrary: *twice* I have acknowledged that a level of caution about
adding new features is justified.
My argument is that in *this* case, the cost-benefit analysis falls
firmly on the "benefit" side, not that any opposition is misguided.
Whereas your attack on me comes perilously close to poisoning the
well: "oh, pay no attention to his arguments, he is the sort of
person who caricatures those who disagree as Luddites".
> > I can only think of one scenario where this change might
> > break someone's code:
> Again, who claimed it would break code?
Any addition of a new feature has the risk of breaking code, and we
ought to consider that possibility.
> It's not obvious to me from the names that the startswith/endswith
> test is included in the method, although on reflection it would be
> weird if it wasn't.
Agreed. We can't be completely explicit about everything, it isn't
> Still, I wouldn't be surprised to see
> if string.startswith("spam"):
> in a new user's code.
That's the sort of inefficient code newbies often write, and the fix for
that is experience and education. I'm not worried about that, just as
I'm not worried about newbies writing:
if string.startswith(" "):
string = string.lstrip(" ")
> You're wrong about "no significant downsides," in the sense that
> that's the wrong criterion. The right criterion is "if we add a slew
> of features that clear the same bar, does the total added benefit from
> that set exceed the cost?" The answer to that question is not a
> trivial extrapolation from the question you did ask, because the
> benefits will increase approximately linearly in the number of such
> features, but the cost of additional complexity is generally
I disagree that the benefits of new features scale linearly.
There's a certain benefit to having (say) str.strip, and a certain
benefit of having (say) string slicing, and a certain benefit of
having (say) str.upper, but being able to do *all three* is
much more powerful than just being able to do one or another.
And I have no idea about the "additional complexity" (of what? the
language? the interpreter?) because we don't really have a good way of
measuring complexity of a language.
> I also disagree they meet a real need, as explained above. They're
> merely convenient.
I don't understand how you can question whether or not people need to
cut prefixes and suffixes in the face of people writing code to cut
prefixes and suffixes. (Sometimes *wrong* code.)
We have had a few people on this list explicitly state that they cut
prefixes and suffixes, there's the evidence of the dozens of people who
misused strip() to cut prefixes and suffixes, and there's history of
people asking how to do it:
This same question comes up time and time again, and you're questioning
whether people need to do it.
Contrast to a hypothetical suggested feature which doesn't meet a real
need (or at least nobody has yet suggested one, as yet): Jonathon Fine's
suggestion that we define a generalised "string subtraction" operator.
Jonathon explained that this is well-defined within the bounds of free
groups and category theory. That's great, but being well-defined is
only the first step. What would we use a generalised string subtraction
for? What need does it meet?
There are easy cases:
"abcd" - "d" # remove a suffix
-"a" + "abcd" # remove a prefix
but in the full generality, it isn't clear what "abcd" - "z" would be
useful for. Lacking a use-case for full string subtraction, we can
reject adding it as a builtin feature or even a stdlib module.
> And the bikeshedding isn't hard. In the list above, cutprefix/
> cutsuffix are far and away the best.
Well I'm glad we agree on that, even if nothing else :-)
More information about the Python-ideas