On Sun, Mar 31, 2019 at 03:05:59AM +0900, Stephen J. Turnbull wrote:
Steven D'Aprano writes:
The correct solution is a verbose statement:
if string.startswith("spam"): string = string[:len("spam")]
This is harder to write than I thought! (The slice should be 'len("spam"):'.) But s/The/A/:
string = re.sub("^spam", "", string)
Indeed, you're right that there can be other solutions, but whether they are "correct" depends on how one defines correct :-) I don't consider something that pulls in the heavy bulldozer of regexes to crack this peanut to be the right way to solve the problem, but YMMV. But for what it's worth, a regex solution is likely to be significantly slower -- see below.
And a slightly incorrect solution (unless you really do want to remove all spam, which most people do, but might not apply to "tooth"):
string = string.replace("spam", "")
Sorry, that's not "slightly" incorrect, that is completely incorrect, for precisely the reason you state: it replaces *all* matching substrings, not just the leading prefix. I don't see a way to easily use replace to implement a prefix cut. I supose one might do: string = string[:-len(suffix)] + string[-len(suffix):].replace(suffix, '') but I haven't tried it and it sure isn't what I would call easy or obvious.
A pair of "cut" methods (cut prefix, cut suffix) fills a real need,
But do they, really? Do we really need multiple new methods to replace a dual-use one-liner, which also handles
outfile = re.sub("\\.bmp$", ".jpg", infile)
Solutions based on regexes are far less discoverable: - all those people who have reported "bugs" in lstrip() and rstrip() could have thought of using a regex instead but didn't; - they involve reading what is effectively another programming language which uses cryptic symbols like "$" instead of words like "suffix". We aren't the Perl community where regexes are the first hammer we reach for every time we need to drive a screw :-) I had to read your re.sub() call twice before I convinced myself that it only replaced a suffix. And we also have to deal with the case where we want to delete a substring containing metacharacters: # Ouch! re.sub(r'\\\.\$$', '', string) # cut literal \.$ suffix Additionally, a regex solution is likely to be slower than even a pure-Python solution, let alone a string method. On my computer, regexes are three times slower than a Python function: $ python3.5 -m timeit -s "import re" "re.sub('eese$', '', 'spam eggs cheese')" 100000 loops, best of 3: 3.75 usec per loop $ python3.5 -m timeit -s "def rcut(suff, s): return s[:-len(suff)] if s.endswith(suff) else s" "rcut('eese', 'spam eggs cheese')" 1000000 loops, best of 3: 1.22 usec per loop
in one line? I concede that the same argument was made against startswith/endswith, and they cleared the bar. Python is a lot more complex now, though, and I think the predicates are more frequently useful.
and will avoid a lot of mistaken bug reports/questions.
That depends on analogies to other languages.
I don't think it matters that much. Of course it doesn't help if you come to Python from a language where strip() deletes a prefix or suffix, but even if you don't, as I don't, there's something about the pattern: string = string.lstrip("spam") which looks like it ought to remove a prefix rather than a set of characters. I've fallen for that error myself.
Coming from Emacs, I'm not at all surprised that .strip takes a character class as an argument and strips until it runs into a character not in the class.
And neither am I... until I forget, and get surprised that it doesn't work that way. This post is already too long, so in the interest of brevity and my dignity I'll skip the anecdote about the time I too blundered publicly about the "bug" in [lr]strip.
Evidently others have different intuition. If that's from English, and they know about cutprefix/cutsuffix, yeah, they won't make the mistake. If it's from another programming language they know, or they don't know about cutprefix, they may just write "string.strip('.jpg')" without thinking about it and it (sometimes) works, then they report a bug when it doesn't. Remember, these folks are not understanding the docs, and very likely not reading them.
Its not reasonable to hold the failure of the proposed new methods to prevent *all* erroneous uses of [lr]strip against them. Short of breaking backwards compatibility and changing strip() to remove_characters_from_a_set_not_a_substring_read_the_docs_before_reporting_any_bugs() there's always going to be *someone* who makes a mistake. But with an easily discoverable alternative available, the number of such errors should plummett as people gradually migrate to 3.8 or above.
As for the disruption,
The word is "complexity". Where do you get "disruption" from?
If you had read the text I quoted before trimming it, you would have seen that it was from Chris Barker: On Fri, Mar 29, 2019 at 04:05:55PM -0700, Christopher Barker wrote:
This proposal would provide a minor gain for an even more minor disruption.
I try very hard to provide enough context that my comments are understandable, and I don't always succeed, but the reader has to meet me part way by at least skimming the quoted text for context before questioning me :-)
code is a cost, but there is also the uncounted opportunity cost of *not* adding this useful battery.
Obviously some people think it's useful. Nobody denies that.
Well, further on you do question whether it meets a real need, so there is at least one :-)
The problem is *measuring* the opportunity cost of not having the battery, or the "usefulness" of the battery, as well as measuring the cost of complexity.
We have never objectively measured these things before, because they can't be. We don't even have a good, objective measurement of complexity of the language -- but if we did, I'm pretty sure that adding a pair of fairly simple, self-explanatory methods to the str class would not increase it by much. We're on steadier ground if we talk about complexity of the user's code. In that case, whether we measure the complexity of a program by lines of code or number of functions or some other more complicated measurement, it ought to be self-evident that being able to replace a helper function with a built-in will slightly reduce complexity. For the sake of the argument, if we can decrease the complexity of a thousand user programs by 1 LOC each, at the cost of increasing the complexity of the interpreter by 100 LOC, isn't that a cost worth paying? I think it is.
Please stop caricaturing those who oppose the change as Luddites.
That's a grossly unjust misrepresentation of my arguments. Nothing I have said can be fairly read as a caricature of the opposing point, let alone as attacks on others for being Luddites. On the contrary: *twice* I have acknowledged that a level of caution about adding new features is justified. My argument is that in *this* case, the cost-benefit analysis falls firmly on the "benefit" side, not that any opposition is misguided. Whereas your attack on me comes perilously close to poisoning the well: "oh, pay no attention to his arguments, he is the sort of person who caricatures those who disagree as Luddites".
I can only think of one scenario where this change might break someone's code:
Again, who claimed it would break code?
Any addition of a new feature has the risk of breaking code, and we ought to consider that possibility. [...]
It's not obvious to me from the names that the startswith/endswith test is included in the method, although on reflection it would be weird if it wasn't.
Agreed. We can't be completely explicit about everything, it isn't practical: math.trigonometric_sine_where_the_angle_is_measured_in_radians(x)
Still, I wouldn't be surprised to see
if string.startswith("spam"): string.cutprefix("spam")
in a new user's code.
That's the sort of inefficient code newbies often write, and the fix for that is experience and education. I'm not worried about that, just as I'm not worried about newbies writing: if string.startswith(" "): string = string.lstrip(" ")
You're wrong about "no significant downsides," in the sense that that's the wrong criterion. The right criterion is "if we add a slew of features that clear the same bar, does the total added benefit from that set exceed the cost?" The answer to that question is not a trivial extrapolation from the question you did ask, because the benefits will increase approximately linearly in the number of such features, but the cost of additional complexity is generally superlinear.
I disagree that the benefits of new features scale linearly. There's a certain benefit to having (say) str.strip, and a certain benefit of having (say) string slicing, and a certain benefit of having (say) str.upper, but being able to do *all three* is much more powerful than just being able to do one or another. And I have no idea about the "additional complexity" (of what? the language? the interpreter?) because we don't really have a good way of measuring complexity of a language.
I also disagree they meet a real need, as explained above. They're merely convenient.
I don't understand how you can question whether or not people need to cut prefixes and suffixes in the face of people writing code to cut prefixes and suffixes. (Sometimes *wrong* code.) We have had a few people on this list explicitly state that they cut prefixes and suffixes, there's the evidence of the dozens of people who misused strip() to cut prefixes and suffixes, and there's history of people asking how to do it: https://stackoverflow.com/questions/599953/how-to-remove-the-left-part-of-a-... https://stackoverflow.com/questions/16891340/remove-a-prefix-from-a-string https://stackoverflow.com/questions/1038824/how-do-i-remove-a-substring-from... https://codereview.stackexchange.com/questions/33817/remove-prefix-and-remov... https://www.quora.com/Whats-the-best-way-to-remove-a-suffix-of-a-string-in-P... https://stackoverflow.com/questions/3663450/python-remove-substring-only-at-... This same question comes up time and time again, and you're questioning whether people need to do it. Contrast to a hypothetical suggested feature which doesn't meet a real need (or at least nobody has yet suggested one, as yet): Jonathon Fine's suggestion that we define a generalised "string subtraction" operator. Jonathon explained that this is well-defined within the bounds of free groups and category theory. That's great, but being well-defined is only the first step. What would we use a generalised string subtraction for? What need does it meet? There are easy cases: "abcd" - "d" # remove a suffix -"a" + "abcd" # remove a prefix but in the full generality, it isn't clear what "abcd" - "z" would be useful for. Lacking a use-case for full string subtraction, we can reject adding it as a builtin feature or even a stdlib module.
And the bikeshedding isn't hard. In the list above, cutprefix/ cutsuffix are far and away the best.
Well I'm glad we agree on that, even if nothing else :-) -- Steven