[Python-ideas] More user-friendly version for string.translate()

Wed Oct 26 18:48:27 EDT 2016

On 2016-10-26 23:17, Chris Barker wrote:
> I"ve lost track of what (If anything) is actually being proposed here...
> so I"m going to try a quick summary:
>
>
> 1) an easy way to spell "remove all the characters other than these"
>
> I think that's a good idea. What with unicode having an enormous number
> of code points, it really does make sense to have a way to specify only
> what you want, rather than what you don't want.
>
> Back in the good old days of 1-byte chars, it wasn't hard to build up a
> full 256 element translate table -- not so much anymore. And one of the
> whole points of str.translate() is good performance.
>
>  a) a new method:
>
>    str.remove_all_but(sequence_of_chars)
>   (naming TBD)
>
> b) a new flag in translate (Kind of like the decode keywords)
>
>   str.translate(table, missing='ignore'|'remove')
>
c) pass a function that returns the replacement:

     def replace(c):
         return c.upper() if c.isalpha() else ''

     str.translate(replace)

The replacement function could be called only on distinct codepoints.

>
> (b) has the advantage of adding translation and removal in one fell
> swoop -- but if you only want to remove, then you have to make a
> translation table of 1:1 mappings = not hard, but a annoying:
>
> table = {c:c for c in sequence_of_chars}
>
> I'm on the fence about what I personally prefer.
>
> 2) (in another thread, but similar enough) being able to pass in more
> than one string to replace:
>
> str.replace( old=seq_of_strings, new=seq_of_strings )
>
> I know I've wanted this a lot, and certainly from a performance
> perspective, it could be a nice bonus.
>
> But: It overlaps a lot with str.translate -- at least for single
> character replacements. so really why? so it would really only make
> sense if supported multi-char strings:
>
> str.replace(old = ("aword", "another_word"), ("something", "something
> else"))
>
> However: a string IS a sequence of strings, so we'd have confusion about
> that:
>
> str.replace("this", "four")
>
> Does the user want the word "this" replaced with the word "four" -- or
> do they want each character replaced?
>
> Maybe we'd need a .replace_many() method? ugh!
>
> There are also other issues with what to di with repeated / overlapping
> cahractors:
>
> str.replace( ("aaa", "a", "b"), ("b", "bbb", "a")
>
> and all sort of other complications!
>
Possible choices are:

1) Use the given order.

2) Check from the longest to the shortest.

If you're going to pick choice 2, does it have to be 2 tuples/lists? Why 
not a dict instead?

> THAT I think could be nailed down by defining the "order of operations"
> Does it lop through the entire string for each item? or through each
> item for each point in the string? note that if you loop thorugh the
> entire string for each item, you might as well have written the loop
> yourself:
>
> for old, new in sip(old_list, new_list):
>     s = s.replace(old, new))
>
> and at least if the length of the string si long-ish, and the number of
> replacements short-ish -- performance would be fine.
>
>
> *** So the question is -- is there support for these enhancements? If
> so, then it would be worth hashing ot the details.
>
> But the next question is -- does anyone care enough to manage that
> process -- it'll be a lot of work!
>
> NOTE: there has also been a fair bit of discussion in this thread about
> ordinals vs characters, and unicode itself -- I don't think any of that
> resulted in any possible proposals...
>
[snip]