[Python-ideas] New explicit methods to trim strings
Steven D'Aprano
steve at pearwood.info
Mon Apr 1 20:52:52 EDT 2019
On Mon, Apr 01, 2019 at 02:29:44PM +1100, Chris Angelico wrote:
> The multiple affix case has exactly two forms:
>
> 1) Tearing multiple affixes off (eg stripping "asdf.jpg.png" down to
> just "asdf"), which most people are saying "no, don't do that, it
> doesn't make sense and isn't needed"
Perhaps I've missed something obvious (its been a long thread, and I'm
badly distracted with hardware issues that are causing me some
considerable grief), but I haven't seen anyone say "don't do that".
But I have seen David Mertz say that this was the best behaviour:
[quote]
fname = 'silly.jpg.png.gif.png.jpg.gif.jpg'
I'm honestly not sure what behavior would be useful most often for
this oddball case. For the suffixes, I think "remove them all" is
probably the best
[end quote]
I'd also like to point out that this is not an oddball case. There are
two popular platforms where file extensions are advisory not mandatory
(Linux and Mac), but even on Windows it is possible to get files with
multiple, meaningful, extensions (foo.tar.gz for example) as well as
periods used in place of spaces (a.funny.cat.video.mp4).
> 2) Removing one of several options, which implies that one option is a
> strict subpiece of another (eg stripping off "test" and "st")
I take it you're only referring to the problematic cases, because
there's the third option, where none of the affixes to be removed clash:
spam.cut_suffix(("ed", "ing"))
But that's pretty uninteresting and a simple loop or repeated call to
the method will work fine:
spam.cut_suffix("ed").cut_suffix("ing")
just as we do with replace:
spam.replace(",", "").replace(" ", "")
If you only have a few affixes to work with, this is fine. If you have a
lot, you may want a helper function, but that's okay.
> If anyone is advocating for #1, I would agree with saying YAGNI.
David Mertz did.
> But #2 is an extremely unlikely edge case, and whatever semantics are
> chosen for it, *normal* usage will not be affected.
Not just unlikely, but "extremely" unlikely?
Presumably you didn't just pluck that statement out of thin air, but
have based it on an objective and statistically representative review of
existing code and projections of future uses of these new methods. How
could I possibly argue with that?
Except to say that I think it is recklessly irresponsible for people
engaged in language design to dismiss edge cases which will cause users
real bugs and real pain so easily. We're not designing for our personal
toolbox, we're designing for hundreds of thousands of other people with
widely varying needs.
It might be rare for you, but for somebody it will be happening ten
times a day. And for somebody else, it will only happen once a year, but
when it does, their code won't raise an exception it will just silently
do the wrong thing.
This is why replace does not take a set of multiple targets to replace.
The user, who knows their own use-case and what behaviour they want, can
write their own multiple-replace function, and we don't have to guess
what they want.
The point I am making is not that we must not ever support multiple
affixes, but that we shouldn't rush that decision. Let's pick the
low-hanging fruit, and get some real-world experience with the function
before deciding how to handle the multiple affix case.
[...]
> Or all the behaviours actually do the same thing anyway.
In this thread, I keep hearing this message:
"My own personal use-case will never be affected by clashing affixes, so
I don't care what behaviour we build into the language, so long as we
pick something RIGHT NOW and don't give the people actually affected
time to use the method and decide what works best in practice for them."
Like for the str.replace method, the final answer might be "there is no
best behaviour and we should refuse to choose".
Why are we rushing to permanently enshrine one specific behaviour into
the builtins before any of the users of the feature have a chance to use
it and decide for themselves which suits them best?
Now is better than never.
Although never is often better than *right* now.
Somebody (I won't name names, but they know who they are) wrote to me
off-list some time ago and accused me of being arrogant and thinking I
know more than everyone else. Well perhaps I am, but I'm not so arrogant
as to think that I can choose the right behaviour for clashing affixes
for other people when my own use-cases don't have clashing affixes.
[...]
> Sure, but I've often wanted to do something like "strip off a prefix
> of http:// or https://", or something else that doesn't have a
> semantic that's known to the stdlib.
I presume there's a reason you aren't using urllib.parse and you just
need a string without the leading scheme. If you're doing further
parsing, the stdlib has the right batteries for that.
(Aside: perhaps urllib.parse.ParseResult should get an attribute to
return the URL minus the scheme? That seems like it would be useful.)
> Also, this is still fairly
> verbose, and a lot of people are going to reach for a regex, just
> because it can be done in one line of code.
Okay, they will use a regex. Is this a problem? We're not planning on
banning regexes are we? If they're happy using regexes, and don't care
that it will be perhaps 3 times slower, let them.
> > I posted links to prior art. Unless I missed something, not one of those
> > languages or libraries supports multiple affixes in the one call.
>
> And they don't support multiple affixes in startswith/endswith either,
> but we're very happy to have that in Python.
But not until we had a couple of releases of experience with them:
https://docs.python.org/2.7/library/stdtypes.html#str.endswith
And .replace still only takes a single target to be replaced.
[...]
> We don't have to worry about edge cases that are
> unlikely to come up in real-world code,
And you are making that pronouncement on the basis of what? Your gut
feeling? Perhaps you're thinking too narrowly.
Here's a partial list of English prefixes that somebody doing text
processing might want to remove to get at the root word:
a an ante anti auto circum co com con contra contro de dis
en ex extra hyper il im in ir inter intra intro macro micro
mono non omni post pre pro sub sym syn tele un uni up
I count fourteen clashes:
a: an ante anti
an: ante anti
co: com con contra contro
ex: extra
in: inter intra intro
un: uni
(That's over a third of this admittedly incomplete list of prefixes.)
I can think of at least one English suffix pair that clash: -ify, -fy.
How about other languages? How comfortable are you to say that nobody
doing text processing in German or Hindi will need to deal with clashing
affixes?
--
Steven
More information about the Python-ideas
mailing list