[Python-ideas] New explicit methods to trim strings

Sun Mar 31 23:29:44 EDT 2019

On Mon, Apr 1, 2019 at 12:35 PM Steven D'Aprano <steve at pearwood.info> wrote:
>
> On Sun, Mar 31, 2019 at 08:23:05PM -0400, David Mertz wrote:
> > On Sun, Mar 31, 2019, 8:11 PM Steven D'Aprano <steve at pearwood.info> wrote:
> >
> > > Regarding later proposals to add support for multiple affixes, to
> > > recursively delete the affix repeatedly, and to take an additional
> > > argument to limit how many affixes will be removed: YAGNI.
> > >
> >
> > That's simply not true, and I think it's clearly illustrated by the example
> > I gave a few times. Not just conceivably, but FREQUENTLY I write code to
> > accomplish the effect of the suggested:
> >
> >   basename = fname.rstrip(('.jpg', '.gif', '.png'))
> >
> > I probably do this MORE OFTEN than removing a single suffix.
>
> Okay.
>
> Yesterday, you stated that you didn't care what the behaviour was for
> the multiple affix case. You made it clear that "any" semantics would be
> okay with you so long as it was documented. You seemed to feel so
> strongly about your indifference that you mentioned it in two seperate
> emails.

The multiple affix case has exactly two forms:

1) Tearing multiple affixes off (eg stripping "asdf.jpg.png" down to
just "asdf"), which most people are saying "no, don't do that, it
doesn't make sense and isn't needed"
2) Removing one of several options, which implies that one option is a
strict subpiece of another (eg stripping off "test" and "st")

If anyone is advocating for #1, I would agree with saying YAGNI. But
#2 is an extremely unlikely edge case, and whatever semantics are
chosen for it, *normal* usage will not be affected. In the example
that David gave, there is no way for "first wins" or "longest wins" or
anything like that to make any difference, because it's impossible for
there to be multiple candidates.

Since this would be going into the language as a feature, the
semantics have to be clearly defined (with "first match wins",
"longest match wins", and "raise exception" being probably the most
plausible options), but most of us aren't going to care which one is
picked.

> That doesn't sound like someone who has a clear use-case in mind. If
> you're doing this frequently, then surely one of the following two
> alternatives apply:
>
> (1) One specific behaviour makes sense for all or a majority of your
> use-cases, in which case you would prefer that behaviour rather than
> something that you can't use.
>
> (2) Or there is no single useful behaviour that you want, perhaps all or
> a majority of your use-cases are different, and you'll usually need to
> write your own helper function to suit your own usage, no matter what
> the builtin behaviour is. Hence you don't care what the builtin
> behaviour is.

Or all the behaviours actually do the same thing anyway.

> Lacking a good set of semantics for removing multiple affixes at once,
> we shouldn't rush to guess what people want. You don't even know what
> behaviour YOU want, let alone what the community as a whole needs.

We're basically debating collision semantics here. It's on par with
asking "how should statistics.mode() cope with multiple modes?".
Should the introduction of statistics.mode() have been delayed pending
a thorough review of use-cases, or is it okay to make it do what most
people want, and then be prepared to revisit its edge-case handling?

(For those who don't know, mode() was changed in 3.8 to return the
first mode encountered, in contrast to previous behaviour where it
would raise an exception.)

> For the use-case of stripping a single file extension out of a set of
> such extensions, while leaving all others, there's an obvious solution:
>
>     if fname.endswith(('.jpg', '.png', '.gif'):
>         basename = os.path.splitext(fname)[0]
>     else:
>         # Any other extension stays with the base.
>         # (Presumably to be handled seperately?)
>         basename = fname

Sure, but I've often wanted to do something like "strip off a prefix
of http:// or https://", or something else that doesn't have a
semantic that's known to the stdlib. Also, this is still fairly
verbose, and a lot of people are going to reach for a regex, just
because it can be done in one line of code.

> I posted links to prior art. Unless I missed something, not one of those
> languages or libraries supports multiple affixes in the one call.

And they don't support multiple affixes in startswith/endswith either,
but we're very happy to have that in Python.

The parallel is strong. You ask if it has a prefix, you remove the
prefix. You ask if it has multiple prefixes, you remove any one of
those prefixes. We don't have to worry about edge cases that are
unlikely to come up in real-world code, just as long as the semantics
ARE defined somewhere.

ChrisA