[Python-ideas] New explicit methods to trim strings

Stephen J. Turnbull turnbull.stephen.fw at u.tsukuba.ac.jp
Tue Apr 2 13:55:02 EDT 2019


Rhodri James writes:

Steven d'Aprano writes:
 > > (That's over a third of this admittedly incomplete list of prefixes.)
 > > 
 > > I can think of at least one English suffix pair that clash: -ify, -fy.

And worse: is "tries" the third person present tense of "try" or is it
the plural of "trie"?  Pure lexical manipulation can't tell you.

 > You're beginning to persuade me that cut/trim methods/functions aren't a 
 > good idea :-)

I don't think I would go there yet (well, I started there, but...).

 > So far we have two slightly dubious use-cases.
 > 
 > 1. Stripping file extensions.  Personally I find that treating filenames 
 > like filenames (i.e. using os.path or (nowadays) pathlib) results in me 
 > thinking more appropriately about what I'm doing.

Very much agree.

 > 2. Stripping prefixes and suffixes to get to root words.

    for suffix in english_suffixes:
        root = word.cutsuffix(suffix)
        if lookup_in_dictionary(root):
            do_something_appropriate_with_each_root_found()

is surely more flexible and accurate than a hard-coded slice, and
significantly more readable than

    for suffix in english_suffixes:
        root = word[:-len(suffix)] if word.endswith(suffix) else word
        if lookup_in_dictionary(root):
            do_something_appropriate_with_each_root_found()

I think enough so that I might use a local def for cutsuffix if the
method doesn't exist.  So my feeling is that the use case for "or"-ing
multiple suffixes is a lot weaker than it is for .endswith, but
.cutsuffix itself is plausible.  That said, I wouldn't add it if it
were up to me.

Among other things, for this root-extracting application

    def extract_root(word, prefix, suffix):
        word = word[len(prefix):] if word.endswith(prefix) else word
        word = word[:-len(suffix)] if word.endswith(suffix) else word
        # perhaps try further transforms like tri -> try here?
        return word

and a double loop

    for prefix in english_prefixes:        # includes ''
        for suffix in english_suffixes:    # includes ''
            root = extract_root(word, prefix, suffix)
            if lookup_in_dictionary(root):
                yield root

(probably recursive, as well) seems most elegant.

 > 3. My most common use case (not very common at that) is for stripping 
 > annoying prompts off text-based APIs.  I'm happy using
 > .startswith() and string slicing for that, though your point about
 > the repeated use of the string to be stripped off (or worse,
 > hard-coding its length) is well made.

I don't understand this use case, specifically the opposition to
hard-coding the length.  Although hard-coding the length wouldn't
occur to me in many cases, since I'd use

    # remove my bash prompt
    prompt_re = re.compile(r'^[^\u0000-\u001f\u007f]+ \d\d:\d\d\$ ')
    lines = [prompt_re.sub('', line) for line in lines]

if I understand the task correctly.  Similarly, there's a lot of
regexp-removable junk in MTA logs, timestamps and DNS lookups for
example, that can't be handled with cutprefix.

 > I am beginning to worry slightly that actually there are usually
 > more appropriate things to do than simply cutting off affixes, and
 > that in providing these particular batteries we might be
 > encouraging poor practise.

I don't think that's a worry, at least if restricted to the
single-affix form, because simply cutting off affixes is surely part
of most such algorithms.  The harder part is remembering that you
probably have to deal with multiplicities and further transformations,
but that can't be incentivized by refusing to implement .cutsuffix.
It's an independent consideration.

Steve


More information about the Python-ideas mailing list