[Python-ideas] More user-friendly version for string.translate()
Stephen J. Turnbull
turnbull.stephen.fw at u.tsukuba.ac.jp
Tue Oct 25 13:10:40 EDT 2016
Mikhail V writes:
> Good. But of course if I do it with big tables, I would anyway
> need to parse them from some table file.
That is the kind of thing we can dismiss (for now) as a "SMOP" =
"simple matter of programming". You know how to do it, we know how to
do it, if it needs optimization, we can do it later. The part that
requires discussion is the API design.
> So my previous thought on it was, that there could be set of such functions:
>
> str.translate_keep(table) - this is current translate, namely keeps
> non-defined chars untouched
> str.translate_drop(table) - all the same, but dropping non-defined chars
>
> Probaly also a pair of functions without translation:
> str.remove(chars) - removes given chars
> str.keep(chars) - removes all, except chars
>
> Motivation is that those can be optimised for speed and I suppose those
> can work faster than re.sub().
Others are more expert than I, but as I understand it, Python's
function calls are expensive enough that dispatching to internal
routines based on types of arguments adds negligible overhead.
Optimization also can wait.
That said, multiple methods is a valid option for the API. Eg, Guido
generally prefers that distinctions that can't be made on type of
arguments (such as translate_keep vs translate_drop) be done by giving
different names rather than a flag argument. Do you *like* this API,
or was this motivated primarily by the possibilities you see for
optimization?
> The question is how common are these tasks, I don't have any
> statistics regarding this.
Frequency is useful information, but if you don't have it, don't worry
about it.
> So in general case they should expand to 32 bit unsigned integers if I
> understand correctly?
No. The internal string representation is described here:
https://www.python.org/dev/peps/pep-0393/. As in the Unicode standard
itself, you should think of characters as integers. Yes, with PEP 393
you can deduce the representation of a string from its contents, but
you can't guess for individual characters in a longer string -- the
whole string has the width needed for its widest character.
> so I should be able to use those on any data chunk without
> thinking, if it is a text or not, this implies of course I must be
> sure that units are expanded to fixed bytesize.
The width is constant for any given string. However, I don't see at
this point that you'll need more than the functions available in
Python already, plus one or more wrappers to marshal the information
your API accepts to the data that str.translate wants. Of course
later it may be worthwhile to rewrite the wrapper in C and merge it
into the existing str.translate(), or the multiple methods you suggest
above.
> >> but as said I don't like very much the idea and would be OK for me to
> >> use numeric values only.
> Yeah I am strange. This however gives you guarantee for any environment that you
> can see and input them ans save the work in ASCII.
This is not going to be a problem if you're running Python and can
enter the program and digits. In any case, the API is going to have
to be convenient for all the people who expect that they will never
again be reduced to a hex keypad and 7-segment display.
More information about the Python-ideas
mailing list