[Python-ideas] New explicit methods to trim strings
cs at cskk.id.au
Tue Apr 2 18:58:07 EDT 2019
On 02Apr2019 12:23, Paul Moore <p.f.moore at gmail.com> wrote:
>On Tue, 2 Apr 2019 at 12:07, Rhodri James <rhodri at kynesim.co.uk> wrote:
>> So far we have two slightly dubious use-cases.
>> 1. Stripping file extensions. Personally I find that treating filenames
>> like filenames (i.e. using os.path or (nowadays) pathlib) results in me
>> thinking more appropriately about what I'm doing.
>I'd go further and say that filename manipulation is a great example
>of a place where generic string functions should definitely *not* be
Filename manipulation on a path _component_ is generally pretty reliable
(yes one can break things by, say, inserting os.sep).
I do a fair bit of filename fiddling using string functions, and these
fall into 3 categories off the top of my head:
- file extensions, and here I do use splitext()
- trimming extensions (only barely a second case), and it turns out the
only case I could easily find using the endswith/[:-offset]
incantation would probably go just as well with splitext()
- normalising pathnames; as an example, for the home media library I
routinely downcase filenames, convert whitespace into a dash, separate
fields with "--" (eg episode designator vs title) and convert _ into a
colon (hello Mac Finder and UI file save dialogues, a holdover
compatibility mode from OS9)
None of these seem to benefit directly from having a cutprefix/cutsuffix
method. But splitext aside, I'm generally fiddling a pathname component
(and usually a basename), and in that domain the general string
functions are very handy and well used.
So I think "filename" (basename) fiddling with str methods is actually
pretty reasonable. It is _pathname_ fiddling that is hazardous, because
the path separators often need to be treated specially.
>> 2. Stripping prefixes and suffixes to get to root words. Python has
>> been used for natural language work for over a decade, and I don't think
>> I've heard any great call from linguists for the functionality. English
>> isn't a girl who puts out like that on a first date :-) There are too
>> many common exception cases for such a straightforward approach not to
>> cause confusion.
>Agreed, using prefix/suffix stripping on natural language is at best a
Yeah. I was looking at the prefix list from a related article and seeing
"intra" and thinking "intractable". Hacky indeed. _Unless_ the word has
already been qualified as suitable for the action. And once it is, a
cutprefix method would indeed be handy.
>> 3. My most common use case (not very common at that) is for stripping
>> annoying prompts off text-based APIs. I'm happy using .startswith() and
>> string slicing for that, though your point about the repeated use of the
>> string to be stripped off (or worse, hard-coding its length) is well made.
In some ways the verbosity and bugproneness is my personal use case for
cutprefix/cutsuffix (however spelt):
- repeating the string is wordy and requires human eyeballing whenever I
read it (to check for correctness); the same applies whenever I write
such a piece of code - personally I'm quite prone to off-by-one errors
when hand writing variations on this
- a well named method is more readable and expresses intent better (the
same argument holds for a standalone function, though a method is a
- the anecdotally not uncommon misuse of .strip() where .cutsuffix()
with be correct
I confess being a little surprised at how few examples which could use
cutsuffix I found in my own code, where I had expected it to be common.
I find several bits line this:
# parsing text which may have \r\n line endings
line = line[:-1]
# parsing a UNIX network interface listing from ifconfig,
# which varies platform to platform
ifname = ifname[:-1]
Here I DO NOT want rstrip() because I want to strip only one character,
rather than as many as there are. So: the optional trailing marker in
some input. But doing this for single character markers is much easier
to get right than the broader case with longer suffixes, so I think this
is not a very strong case.
Fiddling the domain suffix on an email address:
if not addr.endswith(old_domain):
raise ValueError('addr does not end in old_domain')
addr2 = addr[:-len(old_domain)] + new_domain
which would be a good fit, _except_ for the sanity check. However, that
sanity check is just one of a few preceeding the change, so in fact this
is a good fit.
I have a few classes which annotate their instances with some magic
attributes. Here's a snippet from a class' __getattr__ for a db schema:
# *_table ==> table "*"
nickname = attr[:-6]
if nickname in self.table_by_nickname:
There's a little suite of "match attribute suffix, trim and do something
specific with what's left" if statements. However, they are almost all
of the form above, so rewriting it like this:
# *_table ==> table "*"
nickname = attr.cutsuffix('_table')
if nickname in self.table_by_nickname:
is a small improvement. Eevry magic number (the "6" above) is an
opportunity for bugs.
>> I am beginning to worry slightly that actually there are usually more
>> appropriate things to do than simply cutting off affixes, and that in
>> providing these particular batteries we might be encouraging poor practise.
>It would be really helpful if someone could go through the various use
>cases presented in this thread and classify them - filename
>manipulation, natural language uses, and "other".
Surprisingly for me, the big subjective win is avoiding misuse of
lstrip/rstrip by having obvious better named alternatives for affix
Short summary: in my own code I find oportunities for an affix trim
method less common than I had expected. But I still like the "might find
it useful" argument.
I think I find "might find it useful" more compelling than many do. Let
I think a _well_ _defined_ battery is worth including in the kit (str
- the operation is simple and well defined: people won't be confused by
its purpose, and when they want it there is a reliable debugged method
sitting there ready for use
- variations on this get written _all the time_, and writing those
variations using the method is more readable and more reliable
- the existing .strip battery is misused for this purpose by accident
I have in the past found myself arguing for adding little tools like
this in agile teams, and getting a lot of resistence. The resistence
tended to take these forms:
- YAGNI. While the tiny battery _can_ be written longhand, every time
that happens makes for needlessly verbose code, is an opportunity for
stupid bugs, and makes code whose purpose must be _deduced_ rather
than doing what it says on the tin
- not in this ticket: this leads to a starvation issue - the battery
never goes in with any ticket, and a ticket just for the battery never
gets chosen for a sprint
- we've already got this other battery; subtext "not needed" or "we
don't want 2 ways to do this", my subtext "does it worse, or does
something which only _looks_ like this purpose". Classic example from
the codebase I was in at the time was SQL parameter insertion.
Eventually I said "... this" and wrote the battery anyway.
My position on cut*affix is that (a) it is easy to implement (b) it can
thus be debugged once (c) it makes code clearer when used (d) it reduces
the liklihood of .strip() misuse.
Cameron Simpson <cs at cskk.id.au>
More information about the Python-ideas