On 02Apr2019 12:23, Paul Moore <p.f.moore@gmail.com> wrote:
On Tue, 2 Apr 2019 at 12:07, Rhodri James <rhodri@kynesim.co.uk> wrote:
So far we have two slightly dubious use-cases.
1. Stripping file extensions. Personally I find that treating filenames like filenames (i.e. using os.path or (nowadays) pathlib) results in me thinking more appropriately about what I'm doing.
I'd go further and say that filename manipulation is a great example of a place where generic string functions should definitely *not* be used.
Filename manipulation on a path _component_ is generally pretty reliable (yes one can break things by, say, inserting os.sep). I do a fair bit of filename fiddling using string functions, and these fall into 3 categories off the top of my head: - file extensions, and here I do use splitext() - trimming extensions (only barely a second case), and it turns out the only case I could easily find using the endswith/[:-offset] incantation would probably go just as well with splitext() - normalising pathnames; as an example, for the home media library I routinely downcase filenames, convert whitespace into a dash, separate fields with "--" (eg episode designator vs title) and convert _ into a colon (hello Mac Finder and UI file save dialogues, a holdover compatibility mode from OS9) None of these seem to benefit directly from having a cutprefix/cutsuffix method. But splitext aside, I'm generally fiddling a pathname component (and usually a basename), and in that domain the general string functions are very handy and well used. So I think "filename" (basename) fiddling with str methods is actually pretty reasonable. It is _pathname_ fiddling that is hazardous, because the path separators often need to be treated specially.
2. Stripping prefixes and suffixes to get to root words. Python has been used for natural language work for over a decade, and I don't think I've heard any great call from linguists for the functionality. English isn't a girl who puts out like that on a first date :-) There are too many common exception cases for such a straightforward approach not to cause confusion.
Agreed, using prefix/suffix stripping on natural language is at best a "quick hack".
Yeah. I was looking at the prefix list from a related article and seeing "intra" and thinking "intractable". Hacky indeed. _Unless_ the word has already been qualified as suitable for the action. And once it is, a cutprefix method would indeed be handy.
3. My most common use case (not very common at that) is for stripping annoying prompts off text-based APIs. I'm happy using .startswith() and string slicing for that, though your point about the repeated use of the string to be stripped off (or worse, hard-coding its length) is well made.
In some ways the verbosity and bugproneness is my personal use case for cutprefix/cutsuffix (however spelt): - repeating the string is wordy and requires human eyeballing whenever I read it (to check for correctness); the same applies whenever I write such a piece of code - personally I'm quite prone to off-by-one errors when hand writing variations on this - a well named method is more readable and expresses intent better (the same argument holds for a standalone function, though a method is a bit better) - the anecdotally not uncommon misuse of .strip() where .cutsuffix() with be correct I confess being a little surprised at how few examples which could use cutsuffix I found in my own code, where I had expected it to be common. I find several bits line this: # parsing text which may have \r\n line endings if line.endswith('\r'): line = line[:-1] # parsing a UNIX network interface listing from ifconfig, # which varies platform to platform if ifname.endswith(':'): ifname = ifname[:-1] Here I DO NOT want rstrip() because I want to strip only one character, rather than as many as there are. So: the optional trailing marker in some input. But doing this for single character markers is much easier to get right than the broader case with longer suffixes, so I think this is not a very strong case. Fiddling the domain suffix on an email address: if not addr.endswith(old_domain): raise ValueError('addr does not end in old_domain') addr2 = addr[:-len(old_domain)] + new_domain which would be a good fit, _except_ for the sanity check. However, that sanity check is just one of a few preceeding the change, so in fact this is a good fit. I have a few classes which annotate their instances with some magic attributes. Here's a snippet from a class' __getattr__ for a db schema: if attr.endswith('_table'): # *_table ==> table "*" nickname = attr[:-6] if nickname in self.table_by_nickname: There's a little suite of "match attribute suffix, trim and do something specific with what's left" if statements. However, they are almost all of the form above, so rewriting it like this: if attr.endswith('_table'): # *_table ==> table "*" nickname = attr.cutsuffix('_table') if nickname in self.table_by_nickname: is a small improvement. Eevry magic number (the "6" above) is an opportunity for bugs.
I am beginning to worry slightly that actually there are usually more appropriate things to do than simply cutting off affixes, and that in providing these particular batteries we might be encouraging poor practise.
It would be really helpful if someone could go through the various use cases presented in this thread and classify them - filename manipulation, natural language uses, and "other".
Surprisingly for me, the big subjective win is avoiding misuse of lstrip/rstrip by having obvious better named alternatives for affix trimming. Short summary: in my own code I find oportunities for an affix trim method less common than I had expected. But I still like the "might find it useful" argument. I think I find "might find it useful" more compelling than many do. Let me explain. I think a _well_ _defined_ battery is worth including in the kit (str methods) because: - the operation is simple and well defined: people won't be confused by its purpose, and when they want it there is a reliable debugged method sitting there ready for use - variations on this get written _all the time_, and writing those variations using the method is more readable and more reliable - the existing .strip battery is misused for this purpose by accident I have in the past found myself arguing for adding little tools like this in agile teams, and getting a lot of resistence. The resistence tended to take these forms: - YAGNI. While the tiny battery _can_ be written longhand, every time that happens makes for needlessly verbose code, is an opportunity for stupid bugs, and makes code whose purpose must be _deduced_ rather than doing what it says on the tin - not in this ticket: this leads to a starvation issue - the battery never goes in with any ticket, and a ticket just for the battery never gets chosen for a sprint - we've already got this other battery; subtext "not needed" or "we don't want 2 ways to do this", my subtext "does it worse, or does something which only _looks_ like this purpose". Classic example from the codebase I was in at the time was SQL parameter insertion. Eventually I said "... this" and wrote the battery anyway. My position on cut*affix is that (a) it is easy to implement (b) it can thus be debugged once (c) it makes code clearer when used (d) it reduces the liklihood of .strip() misuse. Cheers, Cameron Simpson <cs@cskk.id.au>