[Python-ideas] New explicit methods to trim strings
Stephen J. Turnbull
turnbull.stephen.fw at u.tsukuba.ac.jp
Tue Apr 2 13:55:02 EDT 2019
Rhodri James writes:
Steven d'Aprano writes:
> > (That's over a third of this admittedly incomplete list of prefixes.)
> > I can think of at least one English suffix pair that clash: -ify, -fy.
And worse: is "tries" the third person present tense of "try" or is it
the plural of "trie"? Pure lexical manipulation can't tell you.
> You're beginning to persuade me that cut/trim methods/functions aren't a
> good idea :-)
I don't think I would go there yet (well, I started there, but...).
> So far we have two slightly dubious use-cases.
> 1. Stripping file extensions. Personally I find that treating filenames
> like filenames (i.e. using os.path or (nowadays) pathlib) results in me
> thinking more appropriately about what I'm doing.
Very much agree.
> 2. Stripping prefixes and suffixes to get to root words.
for suffix in english_suffixes:
root = word.cutsuffix(suffix)
is surely more flexible and accurate than a hard-coded slice, and
significantly more readable than
for suffix in english_suffixes:
root = word[:-len(suffix)] if word.endswith(suffix) else word
I think enough so that I might use a local def for cutsuffix if the
method doesn't exist. So my feeling is that the use case for "or"-ing
multiple suffixes is a lot weaker than it is for .endswith, but
.cutsuffix itself is plausible. That said, I wouldn't add it if it
were up to me.
Among other things, for this root-extracting application
def extract_root(word, prefix, suffix):
word = word[len(prefix):] if word.endswith(prefix) else word
word = word[:-len(suffix)] if word.endswith(suffix) else word
# perhaps try further transforms like tri -> try here?
and a double loop
for prefix in english_prefixes: # includes ''
for suffix in english_suffixes: # includes ''
root = extract_root(word, prefix, suffix)
(probably recursive, as well) seems most elegant.
> 3. My most common use case (not very common at that) is for stripping
> annoying prompts off text-based APIs. I'm happy using
> .startswith() and string slicing for that, though your point about
> the repeated use of the string to be stripped off (or worse,
> hard-coding its length) is well made.
I don't understand this use case, specifically the opposition to
hard-coding the length. Although hard-coding the length wouldn't
occur to me in many cases, since I'd use
# remove my bash prompt
prompt_re = re.compile(r'^[^\u0000-\u001f\u007f]+ \d\d:\d\d\$ ')
lines = [prompt_re.sub('', line) for line in lines]
if I understand the task correctly. Similarly, there's a lot of
regexp-removable junk in MTA logs, timestamps and DNS lookups for
example, that can't be handled with cutprefix.
> I am beginning to worry slightly that actually there are usually
> more appropriate things to do than simply cutting off affixes, and
> that in providing these particular batteries we might be
> encouraging poor practise.
I don't think that's a worry, at least if restricted to the
single-affix form, because simply cutting off affixes is surely part
of most such algorithms. The harder part is remembering that you
probably have to deal with multiplicities and further transformations,
but that can't be incentivized by refusing to implement .cutsuffix.
It's an independent consideration.
More information about the Python-ideas