[Python-ideas] More user-friendly version for string.translate()

Tue Oct 25 11:15:58 EDT 2016

On 25 October 2016 at 04:37, Steven D'Aprano <steve at pearwood.info> wrote:

>> I would be happy to see a somewhat more general and user friendly
>> version of string.translate function.
>> It could work this way:
>> string.newtranslate(file_with_table, Drop=True, Dec=True)

> Mikhail, I appreciate that you have many ideas and want to share them,
> but try to think about how those ideas would work. The Python standard
> library is full of really well-designed programming interfaces. You can
> learn a lot by thinking "what existing function is this like? how does
> that existing function work?".

Hi Steven,
Thank you for the reply.
I agree the idea with the file is not good, I already agreed with that
and that was
pointed by others too. Of course it is up to me how do I store the table.
I will try to be more precise with my ideas ;) The new str.translate() interface
is indeed much more versatile and provides good ways to define the table.

>Or it can take a mapping (usually a dict) that maps either characters or
>ordinal numbers to a new string (not just a single character, but an
>arbitrary string) or ordinal numbers.
>
>    str.maketrans({'a': 'A', 98: 66, 0x63: 0x:43})

>(or None, to delete them). Note the flexibility: you don't need to

Good. But of course if I do it with big tables, I would anyway
need to parse them from some table file. Typing all values
direct in code is not a comfortable way. This again should
make it clear how I become the "None" value
after parsing the table from plain format like
97:[nothin here]
(another point for my research).

> Could it be better? Perhaps. I've suggested that maybe translate could
> automatically call maketrans if given more than one argument. Maybe
> there's an easier way to just delete unwanted characters. Perhaps there
> could be a way to say "any character not in the translation table should
> be dropped". These are interesting questions.

So my previous thought on it was, that there could be set of such functions:

str.translate_keep(table) - this is current translate, namely keeps
non-defined chars untouched
str.translate_drop(table) - all the same, but dropping non-defined chars

Probaly also a pair of functions without translation:
str.remove(chars) - removes given chars
str.keep(chars) - removes all, except chars

Motivation is that those can be optimised for speed and I suppose those
can work faster than re.sub(). The question is how common are these tasks,
I don't have any statistics regarding this.

>There are no 16-bit strings.
>Unicode is a 21-bit encoding, usually encoded as either fixed-width
>sequence of 4-byte code units (UTF-32) or a variable-width sequence of
>2-byte (UTF-16) or 1-byte (UTF-8) code units. But it absolutely is not a
>"16-bit string".

So in general case they should expand to 32 bit unsigned integers if I
understand correctly?
IIRC, Windows uses UTF16 for filenames.
Anyway I will not pretend I can give any ideas regarding optimising thing there.
It is just that I tend to treat those translate/filter functions as
purely numeric,
so I should be able to use those on any data chunk without thinking, if it
is a text or not, this implies of course I must be sure that units are
expanded to fixed bytesize.

>> but as said I don't like very much the idea and would be OK for me to
>> use numeric values only.
> I think you are very possibly the only Python programmer in the world
> who thinks that writing decimal ordinal values is more user-friendly
> than writing the actual character itself. I know I would much rather
> see $, π or ╔ than 36, 960 or 9556.

Yeah I am strange. This however gives you guarantee for any environment that you
can see and input them ans save the work in ASCII.

Mikhail