Re: [Python-ideas] New explicit methods to trim strings

2 Apr 2019

      On 02Apr2019 12:23, Paul Moore <p.f.moore@gmail.com> wrote:
...
On Tue, 2 Apr 2019 at 12:07, Rhodri James <rhodri@kynesim.co.uk> wrote:
...
So far we have two slightly dubious use-cases.
1. Stripping file extensions.  Personally I find that treating filenames
like filenames (i.e. using os.path or (nowadays) pathlib) results in me
thinking more appropriately about what I'm doing.
I'd go further and say that filename manipulation is a great example
of a place where generic string functions should definitely *not* be
used.
Filename manipulation on a path _component_ is generally pretty reliable 
(yes one can break things by, say, inserting os.sep).

I do a fair bit of filename fiddling using string functions, and these 
fall into 3 categories off the top of my head:

- file extensions, and here I do use splitext()

- trimming extensions (only barely a second case), and it turns out the 
  only case I could easily find using the endswith/[:-offset] 
  incantation would probably go just as well with splitext()

- normalising pathnames; as an example, for the home media library I 
  routinely downcase filenames, convert whitespace into a dash, separate 
  fields with "--" (eg episode designator vs title) and convert _ into a 
  colon (hello Mac Finder and UI file save dialogues, a holdover 
  compatibility mode from OS9)

None of these seem to benefit directly from having a cutprefix/cutsuffix 
method.  But splitext aside, I'm generally fiddling a pathname component 
(and usually a basename), and in that domain the general string 
functions are very handy and well used.

So I think "filename" (basename) fiddling with str methods is actually 
pretty reasonable. It is _pathname_ fiddling that is hazardous, because 
the path separators often need to be treated specially.
...
...
2. Stripping prefixes and suffixes to get to root words.  Python has
been used for natural language work for over a decade, and I don't think
I've heard any great call from linguists for the functionality.  English
isn't a girl who puts out like that on a first date :-)  There are too
many common exception cases for such a straightforward approach not to
cause confusion.
Agreed, using prefix/suffix stripping on natural language is at best a
"quick hack".
Yeah. I was looking at the prefix list from a related article and seeing 
"intra" and thinking "intractable". Hacky indeed. _Unless_ the word has 
already been qualified as suitable for the action. And once it is, a 
cutprefix method would indeed be handy.
...
...
3. My most common use case (not very common at that) is for stripping
annoying prompts off text-based APIs.  I'm happy using .startswith() and
string slicing for that, though your point about the repeated use of the
string to be stripped off (or worse, hard-coding its length) is well made.
In some ways the verbosity and bugproneness is my personal use case for 
cutprefix/cutsuffix (however spelt):

- repeating the string is wordy and requires human eyeballing whenever I 
  read it (to check for correctness); the same applies whenever I write 
  such a piece of code - personally I'm quite prone to off-by-one errors 
  when hand writing variations on this

- a well named method is more readable and expresses intent better (the 
  same argument holds for a standalone function, though a method is a 
  bit better)

- the anecdotally not uncommon misuse of .strip() where .cutsuffix() 
  with be correct

I confess being a little surprised at how few examples which could use 
cutsuffix I found in my own code, where I had expected it to be common.

I find several bits line this:

     # parsing text which may have \r\n line endings
     if line.endswith('\r'):
       line = line[:-1]

     # parsing a UNIX network interface listing from ifconfig,
     # which varies platform to platform
     if ifname.endswith(':'):
       ifname = ifname[:-1]

Here I DO NOT want rstrip() because I want to strip only one character, 
rather than as many as there are. So: the optional trailing marker in 
some input. But doing this for single character markers is much easier 
to get right than the broader case with longer suffixes, so I think this 
is not a very strong case.

Fiddling the domain suffix on an email address:

     if not addr.endswith(old_domain):
       raise ValueError('addr does not end in old_domain')
     addr2 = addr[:-len(old_domain)] + new_domain

which would be a good fit, _except_ for the sanity check. However, that 
sanity check is just one of a few preceeding the change, so in fact this 
is a good fit.

I have a few classes which annotate their instances with some magic 
attributes. Here's a snippet from a class' __getattr__ for a db schema:

     if attr.endswith('_table'):
       # *_table ==> table "*"
       nickname = attr[:-6]
       if nickname in self.table_by_nickname:

There's a little suite of "match attribute suffix, trim and do something 
specific with what's left" if statements. However, they are almost all 
of the form above, so rewriting it like this:

     if attr.endswith('_table'):
       # *_table ==> table "*"
       nickname = attr.cutsuffix('_table')
       if nickname in self.table_by_nickname:

is a small improvement. Eevry magic number (the "6" above) is an 
opportunity for bugs.
...
...
I am beginning to worry slightly that actually there are usually more
appropriate things to do than simply cutting off affixes, and that in
providing these particular batteries we might be encouraging poor practise.
It would be really helpful if someone could go through the various use
cases presented in this thread and classify them - filename
manipulation, natural language uses, and "other".
Surprisingly for me, the big subjective win is avoiding misuse of 
lstrip/rstrip by having obvious better named alternatives for affix 
trimming.

Short summary: in my own code I find oportunities for an affix trim 
method less common than I had expected. But I still like the "might find 
it useful" argument.

I think I find "might find it useful" more compelling than many do. Let 
me explain.

I think a _well_ _defined_ battery is worth including in the kit (str 
methods) because:

- the operation is simple and well defined: people won't be confused by 
  its purpose, and when they want it there is a reliable debugged method 
  sitting there ready for use

- variations on this get written _all the time_, and writing those 
  variations using the method is more readable and more reliable

- the existing .strip battery is misused for this purpose by accident

I have in the past found myself arguing for adding little tools like 
this in agile teams, and getting a lot of resistence. The resistence 
tended to take these forms:

- YAGNI. While the tiny battery _can_ be written longhand, every time 
  that happens makes for needlessly verbose code, is an opportunity for 
  stupid bugs, and makes code whose purpose must be _deduced_ rather 
  than doing what it says on the tin

- not in this ticket: this leads to a starvation issue - the battery 
  never goes in with any ticket, and a ticket just for the battery never 
  gets chosen for a sprint

- we've already got this other battery; subtext "not needed" or "we 
  don't want 2 ways to do this", my subtext "does it worse, or does 
  something which only _looks_ like this purpose". Classic example from 
  the codebase I was in at the time was SQL parameter insertion.  
  Eventually I said "... this" and wrote the battery anyway.

My position on cut*affix is that (a) it is easy to implement (b) it can 
thus be debugged once (c) it makes code clearer when used (d) it reduces 
the liklihood of .strip() misuse.

Cheers,
Cameron Simpson <cs@cskk.id.au>