Mikhail V writes:
Good. But of course if I do it with big tables, I would anyway need to parse them from some table file.
That is the kind of thing we can dismiss (for now) as a "SMOP" = "simple matter of programming". You know how to do it, we know how to do it, if it needs optimization, we can do it later. The part that requires discussion is the API design.
So my previous thought on it was, that there could be set of such functions:
str.translate_keep(table) - this is current translate, namely keeps non-defined chars untouched str.translate_drop(table) - all the same, but dropping non-defined chars
Probaly also a pair of functions without translation: str.remove(chars) - removes given chars str.keep(chars) - removes all, except chars
Motivation is that those can be optimised for speed and I suppose those can work faster than re.sub().
Others are more expert than I, but as I understand it, Python's function calls are expensive enough that dispatching to internal routines based on types of arguments adds negligible overhead. Optimization also can wait.
That said, multiple methods is a valid option for the API. Eg, Guido generally prefers that distinctions that can't be made on type of arguments (such as translate_keep vs translate_drop) be done by giving different names rather than a flag argument. Do you *like* this API, or was this motivated primarily by the possibilities you see for optimization?
The question is how common are these tasks, I don't have any statistics regarding this.
Frequency is useful information, but if you don't have it, don't worry about it.
So in general case they should expand to 32 bit unsigned integers if I understand correctly?
No. The internal string representation is described here: https://www.python.org/dev/peps/pep-0393/. As in the Unicode standard itself, you should think of characters as integers. Yes, with PEP 393 you can deduce the representation of a string from its contents, but you can't guess for individual characters in a longer string -- the whole string has the width needed for its widest character.
so I should be able to use those on any data chunk without thinking, if it is a text or not, this implies of course I must be sure that units are expanded to fixed bytesize.
The width is constant for any given string. However, I don't see at this point that you'll need more than the functions available in Python already, plus one or more wrappers to marshal the information your API accepts to the data that str.translate wants. Of course later it may be worthwhile to rewrite the wrapper in C and merge it into the existing str.translate(), or the multiple methods you suggest above.
but as said I don't like very much the idea and would be OK for me to use numeric values only.
Yeah I am strange. This however gives you guarantee for any environment that you can see and input them ans save the work in ASCII.
This is not going to be a problem if you're running Python and can enter the program and digits. In any case, the API is going to have to be convenient for all the people who expect that they will never again be reduced to a hex keypad and 7-segment display.