[Python-ideas] New explicit methods to trim strings

Sun Mar 31 00:43:25 EDT 2019

On Sun, Mar 31, 2019 at 03:05:59AM +0900, Stephen J. Turnbull wrote:
> Steven D'Aprano writes:
> 
>  > The correct solution is a verbose statement:
>  > 
>  >     if string.startswith("spam"):
>  >         string = string[:len("spam")]
> 
> This is harder to write than I thought!  (The slice should be
> 'len("spam"):'.)  But s/The/A/:
>
>     string = re.sub("^spam", "", string)

Indeed, you're right that there can be other solutions, but whether they 
are "correct" depends on how one defines correct :-)

I don't consider something that pulls in the heavy bulldozer of regexes 
to crack this peanut to be the right way to solve the problem, but YMMV.

But for what it's worth, a regex solution is likely to be significantly 
slower -- see below.

> And a slightly incorrect solution (unless you really do want to remove
> all spam, which most people do, but might not apply to "tooth"):
> 
>     string = string.replace("spam", "")

Sorry, that's not "slightly" incorrect, that is completely incorrect, 
for precisely the reason you state: it replaces *all* matching 
substrings, not just the leading prefix.

I don't see a way to easily use replace to implement a prefix cut. I 
supose one might do:

string = string[:-len(suffix)] + string[-len(suffix):].replace(suffix, '')

but I haven't tried it and it sure isn't what I would call easy or 
obvious.

>  > A pair of "cut" methods (cut prefix, cut suffix) fills a real need,
> 
> But do they, really?  Do we really need multiple new methods to
> replace a dual-use one-liner, which also handles
> 
>     outfile = re.sub("\\.bmp$", ".jpg", infile)

Solutions based on regexes are far less discoverable:

- all those people who have reported "bugs" in lstrip() and rstrip() 
  could have thought of using a regex instead but didn't;

- they involve reading what is effectively another programming language
  which uses cryptic symbols like "$" instead of words like "suffix".

We aren't the Perl community where regexes are the first hammer we reach 
for every time we need to drive a screw :-)

I had to read your re.sub() call twice before I convinced myself that it 
only replaced a suffix. And we also have to deal with the case where 
we want to delete a substring containing metacharacters:

    # Ouch!
    re.sub(r'\\\.\$$', '', string)  # cut literal \.$ suffix

Additionally, a regex solution is likely to be slower than even a 
pure-Python solution, let alone a string method. On my computer, regexes 
are three times slower than a Python function:

$ python3.5 -m timeit -s "import re" "re.sub('eese$', '', 'spam eggs cheese')"
100000 loops, best of 3: 3.75 usec per loop

$ python3.5 -m timeit -s "def rcut(suff, s): return s[:-len(suff)] if s.endswith(suff) else s" "rcut('eese', 'spam eggs cheese')"
1000000 loops, best of 3: 1.22 usec per loop

> in one line?  I concede that the same argument was made against
> startswith/endswith, and they cleared the bar.  Python is a lot more
> complex now, though, and I think the predicates are more frequently
> useful.
> 
>  > and will avoid a lot of mistaken bug reports/questions.
> 
> That depends on analogies to other languages.

I don't think it matters that much.

Of course it doesn't help if you come to Python from a language where 
strip() deletes a prefix or suffix, but even if you don't, as I don't, 
there's something about the pattern:

    string = string.lstrip("spam")

which looks like it ought to remove a prefix rather than a set of 
characters. I've fallen for that error myself.

> Coming from Emacs, I'm
> not at all surprised that .strip takes a character class as an
> argument and strips until it runs into a character not in the class.

And neither am I... until I forget, and get surprised that it doesn't 
work that way.

This post is already too long, so in the interest of brevity and my 
dignity I'll skip the anecdote about the time I too blundered publicly 
about the "bug" in [lr]strip.

> Evidently others have different intuition.  If that's from English,
> and they know about cutprefix/cutsuffix, yeah, they won't make the
> mistake.  If it's from another programming language they know, or they
> don't know about cutprefix, they may just write "string.strip('.jpg')"
> without thinking about it and it (sometimes) works, then they report a
> bug when it doesn't.  Remember, these folks are not understanding the
> docs, and very likely not reading them.

Its not reasonable to hold the failure of the proposed new methods 
to prevent *all* erroneous uses of [lr]strip against them.

Short of breaking backwards compatibility and changing strip() to 
remove_characters_from_a_set_not_a_substring_read_the_docs_before_reporting_any_bugs() 
there's always going to be *someone* who makes a mistake. But with an 
easily discoverable alternative available, the number of such errors 
should plummett as people gradually migrate to 3.8 or above.

>  > As for the disruption,
> 
> The word is "complexity".  Where do you get "disruption" from?

If you had read the text I quoted before trimming it, you would have 
seen that it was from Chris Barker:

 On Fri, Mar 29, 2019 at 04:05:55PM -0700, Christopher Barker wrote:
 > This proposal would provide a minor gain for an even more minor disruption.

I try very hard to provide enough context that my comments are 
understandable, and I don't always succeed, but the reader has to meet 
me part way by at least skimming the quoted text for context before 
questioning me :-)

>  > code is a cost, but there is also the uncounted opportunity cost of 
>  > *not* adding this useful battery.
> 
> Obviously some people think it's useful.  Nobody denies that.

Well, further on you do question whether it meets a real need, so there 
is at least one :-)

> The
> problem is *measuring* the opportunity cost of not having the battery,
> or the "usefulness" of the battery, as well as measuring the cost of
> complexity.

We have never objectively measured these things before, because they 
can't be. We don't even have a good, objective measurement of complexity 
of the language -- but if we did, I'm pretty sure that adding a pair of 
fairly simple, self-explanatory methods to the str class would not 
increase it by much.

We're on steadier ground if we talk about complexity of the user's code. 
In that case, whether we measure the complexity of a program by lines of 
code or number of functions or some other more complicated measurement, 
it ought to be self-evident that being able to replace a helper function 
with a built-in will slightly reduce complexity.

For the sake of the argument, if we can decrease the complexity of a 
thousand user programs by 1 LOC each, at the cost of increasing the 
complexity of the interpreter by 100 LOC, isn't that a cost worth 
paying? I think it is.

> Please stop caricaturing those who oppose the change as
> Luddites.

That's a grossly unjust misrepresentation of my arguments.

Nothing I have said can be fairly read as a caricature of the opposing 
point, let alone as attacks on others for being Luddites. On the 
contrary: *twice* I have acknowledged that a level of caution about 
adding new features is justified.

My argument is that in *this* case, the cost-benefit analysis falls 
firmly on the "benefit" side, not that any opposition is misguided.

Whereas your attack on me comes perilously close to poisoning the 
well: "oh, pay no attention to his arguments, he is the sort of 
person who caricatures those who disagree as Luddites".

>  > I can only think of one scenario where this change might 
>  > break someone's code:
> 
> Again, who claimed it would break code?

Any addition of a new feature has the risk of breaking code, and we 
ought to consider that possibility.

[...]
> It's not obvious to me from the names that the startswith/endswith
> test is included in the method, although on reflection it would be
> weird if it wasn't.

Agreed. We can't be completely explicit about everything, it isn't 
practical:

math.trigonometric_sine_where_the_angle_is_measured_in_radians(x)

> Still, I wouldn't be surprised to see
> 
>     if string.startswith("spam"):
>         string.cutprefix("spam")
> 
> in a new user's code.

That's the sort of inefficient code newbies often write, and the fix for 
that is experience and education. I'm not worried about that, just as 
I'm not worried about newbies writing:

if string.startswith(" "):
    string = string.lstrip(" ")

> You're wrong about "no significant downsides," in the sense that
> that's the wrong criterion.  The right criterion is "if we add a slew
> of features that clear the same bar, does the total added benefit from
> that set exceed the cost?"  The answer to that question is not a
> trivial extrapolation from the question you did ask, because the
> benefits will increase approximately linearly in the number of such
> features, but the cost of additional complexity is generally
> superlinear.

I disagree that the benefits of new features scale linearly.

There's a certain benefit to having (say) str.strip, and a certain 
benefit of having (say) string slicing, and a certain benefit of 
having (say) str.upper, but being able to do *all three* is 
much more powerful than just being able to do one or another.

And I have no idea about the "additional complexity" (of what? the 
language? the interpreter?) because we don't really have a good way of 
measuring complexity of a language.

> I also disagree they meet a real need, as explained above.  They're
> merely convenient.

I don't understand how you can question whether or not people need to 
cut prefixes and suffixes in the face of people writing code to cut 
prefixes and suffixes. (Sometimes *wrong* code.)

We have had a few people on this list explicitly state that they cut 
prefixes and suffixes, there's the evidence of the dozens of people who 
misused strip() to cut prefixes and suffixes, and there's history of 
people asking how to do it:

https://stackoverflow.com/questions/599953/how-to-remove-the-left-part-of-a-string
https://stackoverflow.com/questions/16891340/remove-a-prefix-from-a-string
https://stackoverflow.com/questions/1038824/how-do-i-remove-a-substring-from-the-end-of-a-string-in-python
https://codereview.stackexchange.com/questions/33817/remove-prefix-and-remove-suffix-functions
https://www.quora.com/Whats-the-best-way-to-remove-a-suffix-of-a-string-in-Python
https://stackoverflow.com/questions/3663450/python-remove-substring-only-at-the-end-of-string

This same question comes up time and time again, and you're questioning 
whether people need to do it.

Contrast to a hypothetical suggested feature which doesn't meet a real 
need (or at least nobody has yet suggested one, as yet): Jonathon Fine's 
suggestion that we define a generalised "string subtraction" operator.

Jonathon explained that this is well-defined within the bounds of free 
groups and category theory. That's great, but being well-defined is 
only the first step. What would we use a generalised string subtraction 
for? What need does it meet?

There are easy cases:

    "abcd" - "d"   # remove a suffix
    -"a" + "abcd"  # remove a prefix

but in the full generality, it isn't clear what "abcd" - "z" would be 
useful for. Lacking a use-case for full string subtraction, we can 
reject adding it as a builtin feature or even a stdlib module.

> And the bikeshedding isn't hard.  In the list above, cutprefix/
> cutsuffix are far and away the best.

Well I'm glad we agree on that, even if nothing else :-)

-- 
Steven