More user-friendly version for string.translate()

Hello all, I would be happy to see a somewhat more general and user friendly version of string.translate function. It could work this way: string.newtranslate(file_with_table, Drop=True, Dec=True) So the parameters: 1. "file_with_table" : a text file with table in following format: #[In] [Out] 97 {65} 98 {66} 99 {67} 100 {} ... 110 {110} Notes: All values are decimal or hex (to switch between parsing format use Dec parameter) As it turned out from my last discussion, majority prefers hex notation, so I am not in mainstream with my decimal notation here, but both should be supported. Empty [Out] value {} means that the character will be deleted. 2. "Drop = True" this will set the default behavior for those values which are NOT in the table. For Drop = True: all values not defined in table set to [out] = {}, and be deleted. For Drop=False: all values not defined in table set [out] = [in], so those remain as is. 3. Dec= True : parsing format Decimal/hex. I use decimal everywhere. Further thoughts: for 8-bit strings this should be simple to implement I think. For 16-bit of course there is issue of memory usage for lookup tables, but the gurus could probably optimise it. E.g. at the parsing stage it is not necessary to build the lookup table for whole 16-bit range of course, but take only values till the largest ordinal present in the table file. About the format of table file: I suppose many users would want also to define characters directly, I am not sure if it is really needed, but if so, additional brackets or escape char could be used, like this for example: a {A} \98 {\66} \99 {\67} but as said I don't like very much the idea and would be OK for me to use numeric values only. So approximately I see it. Feel free to share thoughts or criticise. Mikhail

my thought on this: If you need translate() you probably can write the code to parse a text file, and then you can use whatever format you want. This seems a very special case to build into the stdlib. -CHB On Mon, Oct 24, 2016 at 10:39 AM, Mikhail V <mikhailwas@gmail.com> wrote:
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Mon, Oct 24, 2016 at 10:50 AM, Ryan Birmingham <rainventions@gmail.com> wrote:
I also believe that using a text file would not be the best solution; using a dictionary,
actually, now that you mention it -- .translate() already takes a dict, so if youw ant to put your translation table in a text file, you can use a dict literal to do it: # contents of file:
{ 32: 95,
then use it: s.translate(ast.literal_eval(open("trans_table.txt").read())) now all you need is a tiny little utility function: def translate_from_file(s, filename): return s.translate(ast.literal_eval(open(filename).read())) :-) -Chris
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 24 October 2016 at 20:02, Chris Barker <chris.barker@noaa.gov> wrote:
Yes making special file format is not a good option I agree. Also of course it does not have sence to read it everytime if translate is called in a loop with the same table. So it was merely a sketch of behaviour. But how would you with current translate function drop all characters that are not in the table? so I can pass [deletechars] to the function but this seems not very convenient to me -- very often I want to drop them *all*, excluding some particular values. This for example is needed for filtering out all non-standard characters from paths, etc. So in other words, there should be an option to control this behavior. Probably I am missing something here, but I didn't find such solution for translate() and that is main point of proposal actually. It is all the same as translate() but with this extension it can cover much more usage cases. Mikhail

On Mon, Oct 24, 2016 at 1:30 PM, Mikhail V <mikhailwas@gmail.com> wrote:
But how would you with current translate function drop all characters that are not in the table?
that is another question altogether, and one for a different list, actually. I don't know a way to do "remove every character except these", but someone I expect there is a way to do that efficiently with Python strings. you could probably (ab)use the codecs module, though. If there really is no way to do it, then you might have feature worth pursuing, but be prepared with use-cases! The only use-case I've had for that sort of this is when I want only ASCII -- but I can uses the ascii codec for that :-) This for example
is needed for filtering out all non-standard characters from paths, etc.
You'd usually want to replace those with something, rather than remove them entirely, yes? -CHB
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 24 October 2016 at 21:54, Chris Barker <chris.barker@noaa.gov> wrote:
I don't know a way to do "remove every character except these", but someone I expect there is a way to do that efficiently with Python strings.
It's easy enough with the re module:
re.sub('[^0-9]', '', 'ab0c2m3g5') '0235'
Possibly because there's a lot of good Python builtins that allow you to avoid the re module when *not* needed, it's easy to forget it in the cases where it does pretty much exactly what you want, or can be persuaded to do so with much less difficulty than rolling your own solution (I know I'm guilty of that...). Paul

On 24 October 2016 at 23:10, Paul Moore <p.f.moore@gmail.com> wrote:
Thanks, this would solve the task of course. However for example in the case in my last example (filenames) this would require: - Write a function to construct the expression for "all except given" characters from my table. This could be easy I believe, but still another task. Then: 1. Apply translate() with my table to the string. 2. Apply re.sub() to the string. I usually start using RE when I want to find/replace words or patterns, but not translate/filter the characters directly. So since there is already an "inclusive" translate() then probably having an "exclusive" one is not a bad idea. I believe it is something very similar in implementation, so instead of appending next character which is not in the table, it simply does nothing. Mikhail

There is a LOT of overhead to figuring out how to use the re module. I've always though t it had it's place, but it sure seems like overkill for something this seemingly simple. If (a big if) removing "all but these" was a common use case, it would be nice to have a way to do it with string methods. This is a classic case of: Put it on PyPi, and see how much interest it garners. -CHB

On 24 October 2016 at 22:54, Chris Barker <chris.barker@noaa.gov> wrote:
Just a pair of usage cases which I was facing in my practice: 1. Imagine I perform some admin tasks in a company with very different users who also tend to name the files as they wish. So only God knows what can be there in filenames. And I know foe example that there can be Cyrillic besides ASCII their. So I just define a table like: { 1072: 97 1073: 98 1074: 99 ... [which localizes Cyrillic into ASCII] ... 97:97 98:98 99:99 ... [those chars that are OK, leave them] } Then I use os.walk() and os.rename() and voila! the file system regains it virginity in one simple script. 2. Say I have a multi-lingual file or whatever, I want to filter out some unwanted characters so I can do it similarly. Mikhail

Just a pair of usage cases which I was facing in my practice:
This sounds like a perfect use case for str.translate() as it is.
Filtering out is different-- but I would think that you would want replace, rather than remove. If you wanted names to all comply with a given encoding (ascii or Latin-1, or...), then encoding/decoding (with error set to replace) would do nicely. -CHB
Mikhail

On 24 October 2016 at 18:39, Mikhail V <mikhailwas@gmail.com> wrote:
Using a text file seems very odd. But regardless, this could *easily* be published on PyPI, and then if it gained enough users be proposed for the stdlib. I don't think there's anything like sufficient value to warrant "fast-tracking" something like this direct to the stdlib. And real-world use via PyPI would very quickly establish whether the unusual "pass a file with a translation table in it" design was acceptable to users. Paul

On Mon, Oct 24, 2016 at 07:39:16PM +0200, Mikhail V wrote:
That's an interesting concept for "user friendly". Apart from functions that are actually designed to read files of a particular format, can you think of any built-in functions that take a file as argument? This is how you would use this "user friendly version of translate": path = '/tmp/table' # hope no other program is using it... with open(path, 'w') as f: f.write('97 {65}\n') f.write('98 {66}\n') f.write('99 {67}\n') with open(path, 'r') as f: new_string = old_string.newtranslate(f, False, True) Compared to the existing solution: new_string = old_string.translate(str.maketrans('abc', 'ABC')) Mikhail, I appreciate that you have many ideas and want to share them, but try to think about how those ideas would work. The Python standard library is full of really well-designed programming interfaces. You can learn a lot by thinking "what existing function is this like? how does that existing function work?". str.translate and str.maketrans already exist. Look at how maketrans builds a translation table: it can take either two equal length strings, and maps characters in one to the equivalent character in the other: str.maketrans('abc', 'ABC') Or it can take a mapping (usually a dict) that maps either characters or ordinal numbers to a new string (not just a single character, but an arbitrary string) or ordinal numbers. str.maketrans({'a': 'A', 98: 66, 0x63: 0x:43}) (or None, to delete them). Note the flexibility: you don't need to specify ahead of time whether you are specifying the ordinal value as a decimal, hex, octal or binary value. Any expression that evaluates to a string or a int within the legal range is valid. That's a good programming interface. Could it be better? Perhaps. I've suggested that maybe translate could automatically call maketrans if given more than one argument. Maybe there's an easier way to just delete unwanted characters. Perhaps there could be a way to say "any character not in the translation table should be dropped". These are interesting questions.
Further thoughts: for 8-bit strings this should be simple to implement I think.
I doubt that these new features will be added to bytes as well as strings. For 8-bits byte strings, it is easy enough to generate your own translation and deletion tables -- there are only 256 values to consider.
There are no 16-bit strings. Unicode is a 21-bit encoding, usually encoded as either fixed-width sequence of 4-byte code units (UTF-32) or a variable-width sequence of 2-byte (UTF-16) or 1-byte (UTF-8) code units. But it absolutely is not a "16-bit string". [...]
but as said I don't like very much the idea and would be OK for me to use numeric values only.
I think you are very possibly the only Python programmer in the world who thinks that writing decimal ordinal values is more user-friendly than writing the actual character itself. I know I would much rather see $, π or ╔ than 36, 960 or 9556. -- Steve

On 25 October 2016 at 04:37, Steven D'Aprano <steve@pearwood.info> wrote:
Hi Steven, Thank you for the reply. I agree the idea with the file is not good, I already agreed with that and that was pointed by others too. Of course it is up to me how do I store the table. I will try to be more precise with my ideas ;) The new str.translate() interface is indeed much more versatile and provides good ways to define the table.
(or None, to delete them). Note the flexibility: you don't need to
Good. But of course if I do it with big tables, I would anyway need to parse them from some table file. Typing all values direct in code is not a comfortable way. This again should make it clear how I become the "None" value after parsing the table from plain format like 97:[nothin here] (another point for my research).
So my previous thought on it was, that there could be set of such functions: str.translate_keep(table) - this is current translate, namely keeps non-defined chars untouched str.translate_drop(table) - all the same, but dropping non-defined chars Probaly also a pair of functions without translation: str.remove(chars) - removes given chars str.keep(chars) - removes all, except chars Motivation is that those can be optimised for speed and I suppose those can work faster than re.sub(). The question is how common are these tasks, I don't have any statistics regarding this.
So in general case they should expand to 32 bit unsigned integers if I understand correctly? IIRC, Windows uses UTF16 for filenames. Anyway I will not pretend I can give any ideas regarding optimising thing there. It is just that I tend to treat those translate/filter functions as purely numeric, so I should be able to use those on any data chunk without thinking, if it is a text or not, this implies of course I must be sure that units are expanded to fixed bytesize.
Yeah I am strange. This however gives you guarantee for any environment that you can see and input them ans save the work in ASCII. Mikhail

Mikhail V writes:
Good. But of course if I do it with big tables, I would anyway need to parse them from some table file.
That is the kind of thing we can dismiss (for now) as a "SMOP" = "simple matter of programming". You know how to do it, we know how to do it, if it needs optimization, we can do it later. The part that requires discussion is the API design.
Others are more expert than I, but as I understand it, Python's function calls are expensive enough that dispatching to internal routines based on types of arguments adds negligible overhead. Optimization also can wait. That said, multiple methods is a valid option for the API. Eg, Guido generally prefers that distinctions that can't be made on type of arguments (such as translate_keep vs translate_drop) be done by giving different names rather than a flag argument. Do you *like* this API, or was this motivated primarily by the possibilities you see for optimization?
The question is how common are these tasks, I don't have any statistics regarding this.
Frequency is useful information, but if you don't have it, don't worry about it.
So in general case they should expand to 32 bit unsigned integers if I understand correctly?
No. The internal string representation is described here: https://www.python.org/dev/peps/pep-0393/. As in the Unicode standard itself, you should think of characters as integers. Yes, with PEP 393 you can deduce the representation of a string from its contents, but you can't guess for individual characters in a longer string -- the whole string has the width needed for its widest character.
The width is constant for any given string. However, I don't see at this point that you'll need more than the functions available in Python already, plus one or more wrappers to marshal the information your API accepts to the data that str.translate wants. Of course later it may be worthwhile to rewrite the wrapper in C and merge it into the existing str.translate(), or the multiple methods you suggest above.
This is not going to be a problem if you're running Python and can enter the program and digits. In any case, the API is going to have to be convenient for all the people who expect that they will never again be reduced to a hex keypad and 7-segment display.

On 25 October 2016 at 19:10, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Certainly I like the look of distinct functions more. It allows me to visually parse the code effectively, so e.g. for str.remove() I would not need to look in docs to understand what the function does. It has its downside of course, since new definitions can accidentally be similar to current ones, so more names, more the probability that no good names are left. Speed is not so important for majority of cases, at least for my current tasks. However if I'll need to process very large texts (seems like I will), speed will be more important.
Just in some cases I need to convert them to numpy arrays back and forth, so this unicode vanity worries me a bit. But I cannot clearly explain why exactly I need this.
Here I will dare to make a lyrical degression again. It could have made an impression that I am stuck in nineties or something. But that is not the case. In nineties I used the PC mostly to play Duke Nukem (yeh big times!). And all the more I hadnt any idea what is efficiency of information representation and readability. Now I kind of realize it. So I am just not the one who believes in these maximalistical "we need over 9000 glyphs" talks. And, somewhat prophetic view on this: with the come of cyber era this all be flushed so fast, that all this diligences around unicode could look funny actually. And a hex keypad will not sound "retro" but "brand new". In other words: I feel really strong that nothin besides standard characters must appear in code sources. If one wants to process unicode, then parse them as resources. So please, at least out of respect to rationally minded, don't make a code look like a christmas-tree. BTW, I use VIM to code actually so anyway I will not see them in my code. Mikhail

Mikhail V writes:
OK, as I said, you're in accord with Guido on that. His rationale is somewhat different, but that's OK.
Just in some cases I need to convert them to numpy arrays back and forth, so this unicode vanity worries me a bit.
I think you're borrowing trouble you actually don't have. Either way, the rest of the world *needs* Unicode to do their work, and it's not going to go away. On the positive side, turning a string into a list of codepoints is trivial: [ord(c) for c in string]
So I am just not the one who believes in these maximalistical "we need over 9000 glyphs" talks.
But you don't need to believe in it. What you do need to believe is that the rest of us believe that we need the union of our character sets as a single, universal character set. As it happens, although there are differences of opinion over how to handle Unicode in Python, there is consensus that Python does have to handle Unicode flexibly, effectively and efficiently. Believe me, it *is* a consensus. If you insist on bucking it, you'll have to do it pretty much alone, perhaps even maintaining your own fork of Python.

On 26 October 2016 at 20:58, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
All OK now?
Not really. I tried with a simple example intab = "ae" outtab = "XM" table = string.maketrans(intab, outtab) collections.defaultdict(lambda: None, **table) an this gives me TypeError: type object argument after ** must be a mapping, not str But I probably I misunderstood the idea. Anyway this code does not make much sence to me, I would never in life understand what is meant here. And in my not so big, but not so small, Python experience I *never* had an occasion using collections or lambda.
I was merely talking about syntax and sources files standard, not about unicode strings. No doubt one needs some way to store different glyph sets. So I was talking about that if one defines a syntax and has good intentions for readability in mind, there is not so many rationale to adopt the syntax to current "hybrid" system: 7-bit and/or multibyte paradigm. Again this a too far going discussion, but one should not probably much look ahead on those. The situation is not so good in this sense that most standard software is attached to this strange paradigm (even those which does not have anything to do with multi-lingual typography). So IMO something gone wrong with those standard characters.
As for me I would take the path of developing of own IDE which will enable typografic quality rendering and of course all useful glyphs, such as curly quotes, bullets, etc, which all is fundamental to any possible improvements of cognitive qualities of code. And I'll stay in 8-bit boundaries, thats for sure. So if Python will take the path of "unicode" code input (e.g. for some punctuaion characters) this would only add a minor issue for generating valid Python source files in this case. Mikhail

I"ve lost track of what (If anything) is actually being proposed here... so I"m going to try a quick summary: 1) an easy way to spell "remove all the characters other than these" I think that's a good idea. What with unicode having an enormous number of code points, it really does make sense to have a way to specify only what you want, rather than what you don't want. Back in the good old days of 1-byte chars, it wasn't hard to build up a full 256 element translate table -- not so much anymore. And one of the whole points of str.translate() is good performance. a) a new method: str.remove_all_but(sequence_of_chars) (naming TBD) b) a new flag in translate (Kind of like the decode keywords) str.translate(table, missing='ignore'|'remove') (b) has the advantage of adding translation and removal in one fell swoop -- but if you only want to remove, then you have to make a translation table of 1:1 mappings = not hard, but a annoying: table = {c:c for c in sequence_of_chars} I'm on the fence about what I personally prefer. 2) (in another thread, but similar enough) being able to pass in more than one string to replace: str.replace( old=seq_of_strings, new=seq_of_strings ) I know I've wanted this a lot, and certainly from a performance perspective, it could be a nice bonus. But: It overlaps a lot with str.translate -- at least for single character replacements. so really why? so it would really only make sense if supported multi-char strings: str.replace(old = ("aword", "another_word"), ("something", "something else")) However: a string IS a sequence of strings, so we'd have confusion about that: str.replace("this", "four") Does the user want the word "this" replaced with the word "four" -- or do they want each character replaced? Maybe we'd need a .replace_many() method? ugh! There are also other issues with what to di with repeated / overlapping cahractors: str.replace( ("aaa", "a", "b"), ("b", "bbb", "a") and all sort of other complications! THAT I think could be nailed down by defining the "order of operations" Does it lop through the entire string for each item? or through each item for each point in the string? note that if you loop thorugh the entire string for each item, you might as well have written the loop yourself: for old, new in sip(old_list, new_list): s = s.replace(old, new)) and at least if the length of the string si long-ish, and the number of replacements short-ish -- performance would be fine. *** So the question is -- is there support for these enhancements? If so, then it would be worth hashing ot the details. But the next question is -- does anyone care enough to manage that process -- it'll be a lot of work! NOTE: there has also been a fair bit of discussion in this thread about ordinals vs characters, and unicode itself -- I don't think any of that resulted in any possible proposals... -CHB On Wed, Oct 26, 2016 at 2:48 PM, Mikhail V <mikhailwas@gmail.com> wrote:
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 2016-10-26 23:17, Chris Barker wrote:
c) pass a function that returns the replacement: def replace(c): return c.upper() if c.isalpha() else '' str.translate(replace) The replacement function could be called only on distinct codepoints.
Possible choices are: 1) Use the given order. 2) Check from the longest to the shortest. If you're going to pick choice 2, does it have to be 2 tuples/lists? Why not a dict instead?
[snip]

On Wed, Oct 26, 2016 at 3:48 PM, MRAB <python@mrabarnett.plus.com> wrote:
then we have a string.translate() that accepts a table of string replacements, rather than individual character replacements -- maybe a good idea! -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 27 October 2016 at 00:17, Chris Barker <chris.barker@noaa.gov> wrote:
Exactly that is the proposal. And for same exact reason that you point out, I also can't give a comment what would be better. It would be indeed quite strange from syntactical POV if I just want to remove "all except" and must call translate(). So ideally both should exist I think. Mikhail

On Wed, Oct 26, 2016 at 5:32 PM, Mikhail V <mikhailwas@gmail.com> wrote:
That kind of violate OWTDI though. Probably one's enough. and if fact with the use-cases I can think of, and the one you mentioned, they are really two steps: there are the characters you want to translate, and the ones you want to keep, but the ones you want to keep are a superset of the ones you want to translate. so if we added the "remove"option to .translate(), then you would need to add all the "keep" charactors to your translate table. I'm thinking they really are different operations, give them a different method. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 10/26/2016 6:17 PM, Chris Barker wrote:
In other words, 'only keep these'. We already have easy ways to create filtered strings.
I expect the first to be a bit faster. Either can be wrapped in a keep() function. If one has a translation dictionary d, use that in twice in the genexp.
-- Terry Jan Reedy

On Fri, Oct 28, 2016 at 7:28 AM, Terry Reedy <tjreedy@udel.edu> wrote:
s = 'kjskljkxcvnalsfjaweirKJZknzsnlkjsvnskjszsdscccjasfdjf' s2 = ''.join(c for c in s if c in set('abc'))
pretty slick -- but any hope of it being as fast as a C implemented method? for example, with a 1000 char string: In [59]: % timeit string.translate(table) 100000 loops, best of 3: 3.62 µs per loop In [60]: % timeit ''.join(c for c in string if c in set(letters)) 1000 loops, best of 3: 1.14 ms per loop so the translate() method is about 300 times faster in this case. (and it used a defaultdict with a None factory, which is probably a bit slower than a pure C implementation might be. I've always figured that Python's rich string methods provided two things: 1) single method call to do common things 2) nice fast, pure C performance so I think a "keep these" method would help with both of these goals. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Chris Barker writes:
pretty slick -- but any hope of it being as fast as a C implemented method?
I would expect not in CPython, but if "fast" matters, why are you using CPython rather than PyPy or Cython? If it matters *that* much, you can afford to write your own C implementation. But I doubt that fast matters "that much" often enough to be worth maintaining yet another string method in Python. Byte-shoveling servers might want it for bytes, though.
Sure, but the translate method already gives you that, and a lot more. Note that when you're talking about working with Unicode characters, no natural language activity I can imagine (not even translating Buddhist texts, which involves a couple of Indian scripts as well as Han ideographs) uses more than a fraction of defined characters. So really translate with defaultdict is a specialized loop that marries an algorithmic body (which could do things like look up the original script or other character properties to decide on the replacement for the generic case) with a (usually "small") table of exceptions. That seems like inspired design to me.

On Tue, Nov 1, 2016 at 12:15 AM, Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
oh come on!
If it matters *that* much, you can afford to write your own C implementation.
This is about a possible addition to the stdlib -- me writing my own C implementation has nothing to do with it.
This could be said about every string method in Python -- I understand that every addition is more code to maintain. But somehow we are adding all kinds of stuff like yet another string formatting method, talking about null coalescing operators and who knows what -- those are all a MUCH larger burden -- not just for maintaining the interpreter, but for everyone using python having more to remember and understand. On the other hand, powerful and performant string methods are a major plus for Python -- a good reason to us it over Perl :-) So an new one that provides, as I write before:
1) single method call to do a common thing
2) nice fast, pure C performance
would fit right into to Python, and indeed, would be a similar implementation to existing methods -- so the maintenance burden would be a small addition (i.e if the internal representation for strings changed, all those methods would need re-visiting and similar changes) So the only key question is -- is the a common enough use case?
yes but only with the fairly esoteric use of defaultdict. which brings me back to the above: 1) single method call to do a common thing the nice thing about a single method call is discoverability -- no newbie is going to figure out the .translate + defaultdict approach.
which is why you may want to remove all the others :-) So really translate with defaultdict is a specialized loop that
indeed -- .translate() itself is remarkably flexible -- you could even pas in a custom class that does all sorts of logic. and adding the defaultdict is an easy way to add a useful feature. But again, advanced usage and not very discoverable. Maybe that means we need some more docs and/or perhaps recipes instead. Anyway, I joined this thread to clarify what might be on the table -- but while I think it's a good idea, I dont have the bandwidth to move it through the process -- so unless someone steps up that does, we're done. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 27 October 2016 at 00:17, Chris Barker <chris.barker@noaa.gov> wrote:
Actually even with ASCII (read for python 2.7) I would also be happy to have such function: say I just want to keep only digits so I write: digits = "0123456789" newstring = somestring.keep(digits) Despite I can do it other way, this would be much simpler and clearer way to do it. And I suppose it is quite common task not only for me. Currently 99% of my programs are in python 2.7. And I started to use python 3 only for tasks when I want to process unicode strings (ironically only to get rid of unicode). Mikhail

On Wed, Nov 2, 2016 at 12:02 PM, Mikhail V <mikhailwas@gmail.com> wrote:
well, with ascii, it's not too hard to make a translation table: digits = "0123456789" table = [(o if chr(o) in digits else None )for o in range(256)] s = "some stuff and some 456 23 numbers 888" s.translate(table) '45623888' but then there is the defaultdict way: s.translate(defaultdict(*lambda*: *None*, {ord(c):c *for* c *in* digits}.items())) '45623888' wasn't that easy? Granted, if you need to do this, you'd wrap it in a function like Chris A. Suggested. But this really isn't easy or discoverable -- it took me a fair bit of fidlding to get right, and I knew I was looking for a defaultdict implementation. Also: In [43]: table Out[43]: defaultdict(<function __main__.<lambda>>, {48: '0', 49: '1', 50: '2', 51: '3', 52: '4', 53: '5', 54: '6', 55: '7', 56: '8', 57: '9'}) In [44]: s.translate(table) Out[44]: '45623888' In [45]: table Out[45]: defaultdict(<function __main__.<lambda>>, {32: None, 48: '0', 49: '1', 50: '2', 51: '3', 52: '4', 53: '5', 54: '6', 55: '7', 56: '8', 57: '9', 97: None, 98: None, 100: None, 101: None, 102: None, 109: None, 110: None, 111: None, 114: None, 115: None, 116: None, 117: None}) defaultdict puts an entry in for every ordinal checked -- this could get big -- granted, probaly nt a big deal with modern computer memory, but still... it might even be worth making a NoneDict for this: class NoneDict(dict): """ Dictionary implementation that always returns None when a key is not in the dict, rather than raising a KeyError """ def __getitem__(self, key): try: val = dict.__getitem__(self, key) except KeyError: val = None return val (see enclosed -- it works fine with translate) (OK, that was fun, but no, not really that useful) Despite I can do it other way, this would be much simpler and clearer
way to do it. And I suppose it is quite common task not only for me.
That's the key question -- is this a common task? If so, then whie there are ways to do it, they're not easy nor discoverable. And while some of the guiding principles of this list are: "not every two line function needs to be in the standard lib" and "put it up on PYPi, and see if a lot of people find it useful" It's actually kind of silly to put a single function up as a PyPi package -- and I doubt many people will find it if you did. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Thu, Oct 27, 2016 at 8:48 AM, Mikhail V <mikhailwas@gmail.com> wrote:
You're 99% of the way to understanding it. Try the exercise again in Python 3. You don't have string.maketrans (which creates a 256-byte translation mapping) - instead, you use a dictionary. ChrisA

return string.translate(collections.defaultdict(lambda: None, **table))
Nice! I forgot about defautdict -- so this just needs a recipe somewhere -- maybe even in the docs for str.translate. BTW, great use case for defautdict -- I had been wondering what the point was, given that a regular dict as .setdefault -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Tue, Oct 25, 2016 at 05:15:58PM +0200, Mikhail V wrote: [...]
Why not? What is the difference between typing 123: 456 124: 457 125: 458 # two hundred more lines in a "table.txt" file, and typing: { 123: 456, 124: 457, 125: 458, # two hundred more lines } in a "table.py" file? The difference is insignificant. And the Python version can be cleaned up: for i in range(123, 333): table[i] = 456 - 123 + i Not all data whould be written as code, especially if you expect unskilled users to edit it, but generating data directly in code is a very powerful technique, and the strict syntax of the programming language helps prevent some errors. [...]
Motivation is that those can be optimised for speed
That's not a motivation. Why are you talking about "optimizing for speed" functions that we have not yet established are needed? That reminds me of a story I once heard of somebody who was driving across the desert in the US once. One of his passengers noticed the highway signs and said "Wait, aren't we going the wrong way?" The driver replied "Who cares, we're making fantastic time!" Optimizing a function you don't need is not an optimization. It is a waste of time. -- Steve

On Wed, Oct 26, 2016 at 04:29:13AM +0200, Mikhail V wrote:
I need translate() which drops non-defined chars. Please :) No optimisation, no new syntax. deal?
I still wonder whether this might be worth introducing as a new string method, or an option to translate. But the earliest that will happen is Python 3.7, so in the meantime, something like this should be enough: # untested keep = "abcdßαβπд∞" text = "..." # Find all the characters in text that are not in keep: delchars = set(text) - set(keep) delchars = ''.join(delchars) text = text.translate(str.maketrans("", "", delchars)) -- Steve

Mikhail V writes:
I need translate() which drops non-defined chars. Please :)
import collections def translate_or_drop(string, table): """ string: a string to process table: a dict as accepted by str.translate """ return string.translate(collections.defaultdict(lambda: None, **table)) All OK now?

my thought on this: If you need translate() you probably can write the code to parse a text file, and then you can use whatever format you want. This seems a very special case to build into the stdlib. -CHB On Mon, Oct 24, 2016 at 10:39 AM, Mikhail V <mikhailwas@gmail.com> wrote:
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Mon, Oct 24, 2016 at 10:50 AM, Ryan Birmingham <rainventions@gmail.com> wrote:
I also believe that using a text file would not be the best solution; using a dictionary,
actually, now that you mention it -- .translate() already takes a dict, so if youw ant to put your translation table in a text file, you can use a dict literal to do it: # contents of file:
{ 32: 95,
then use it: s.translate(ast.literal_eval(open("trans_table.txt").read())) now all you need is a tiny little utility function: def translate_from_file(s, filename): return s.translate(ast.literal_eval(open(filename).read())) :-) -Chris
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 24 October 2016 at 20:02, Chris Barker <chris.barker@noaa.gov> wrote:
Yes making special file format is not a good option I agree. Also of course it does not have sence to read it everytime if translate is called in a loop with the same table. So it was merely a sketch of behaviour. But how would you with current translate function drop all characters that are not in the table? so I can pass [deletechars] to the function but this seems not very convenient to me -- very often I want to drop them *all*, excluding some particular values. This for example is needed for filtering out all non-standard characters from paths, etc. So in other words, there should be an option to control this behavior. Probably I am missing something here, but I didn't find such solution for translate() and that is main point of proposal actually. It is all the same as translate() but with this extension it can cover much more usage cases. Mikhail

On Mon, Oct 24, 2016 at 1:30 PM, Mikhail V <mikhailwas@gmail.com> wrote:
But how would you with current translate function drop all characters that are not in the table?
that is another question altogether, and one for a different list, actually. I don't know a way to do "remove every character except these", but someone I expect there is a way to do that efficiently with Python strings. you could probably (ab)use the codecs module, though. If there really is no way to do it, then you might have feature worth pursuing, but be prepared with use-cases! The only use-case I've had for that sort of this is when I want only ASCII -- but I can uses the ascii codec for that :-) This for example
is needed for filtering out all non-standard characters from paths, etc.
You'd usually want to replace those with something, rather than remove them entirely, yes? -CHB
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 24 October 2016 at 21:54, Chris Barker <chris.barker@noaa.gov> wrote:
I don't know a way to do "remove every character except these", but someone I expect there is a way to do that efficiently with Python strings.
It's easy enough with the re module:
re.sub('[^0-9]', '', 'ab0c2m3g5') '0235'
Possibly because there's a lot of good Python builtins that allow you to avoid the re module when *not* needed, it's easy to forget it in the cases where it does pretty much exactly what you want, or can be persuaded to do so with much less difficulty than rolling your own solution (I know I'm guilty of that...). Paul

On 24 October 2016 at 23:10, Paul Moore <p.f.moore@gmail.com> wrote:
Thanks, this would solve the task of course. However for example in the case in my last example (filenames) this would require: - Write a function to construct the expression for "all except given" characters from my table. This could be easy I believe, but still another task. Then: 1. Apply translate() with my table to the string. 2. Apply re.sub() to the string. I usually start using RE when I want to find/replace words or patterns, but not translate/filter the characters directly. So since there is already an "inclusive" translate() then probably having an "exclusive" one is not a bad idea. I believe it is something very similar in implementation, so instead of appending next character which is not in the table, it simply does nothing. Mikhail

There is a LOT of overhead to figuring out how to use the re module. I've always though t it had it's place, but it sure seems like overkill for something this seemingly simple. If (a big if) removing "all but these" was a common use case, it would be nice to have a way to do it with string methods. This is a classic case of: Put it on PyPi, and see how much interest it garners. -CHB

On 24 October 2016 at 22:54, Chris Barker <chris.barker@noaa.gov> wrote:
Just a pair of usage cases which I was facing in my practice: 1. Imagine I perform some admin tasks in a company with very different users who also tend to name the files as they wish. So only God knows what can be there in filenames. And I know foe example that there can be Cyrillic besides ASCII their. So I just define a table like: { 1072: 97 1073: 98 1074: 99 ... [which localizes Cyrillic into ASCII] ... 97:97 98:98 99:99 ... [those chars that are OK, leave them] } Then I use os.walk() and os.rename() and voila! the file system regains it virginity in one simple script. 2. Say I have a multi-lingual file or whatever, I want to filter out some unwanted characters so I can do it similarly. Mikhail

Just a pair of usage cases which I was facing in my practice:
This sounds like a perfect use case for str.translate() as it is.
Filtering out is different-- but I would think that you would want replace, rather than remove. If you wanted names to all comply with a given encoding (ascii or Latin-1, or...), then encoding/decoding (with error set to replace) would do nicely. -CHB
Mikhail

On 24 October 2016 at 18:39, Mikhail V <mikhailwas@gmail.com> wrote:
Using a text file seems very odd. But regardless, this could *easily* be published on PyPI, and then if it gained enough users be proposed for the stdlib. I don't think there's anything like sufficient value to warrant "fast-tracking" something like this direct to the stdlib. And real-world use via PyPI would very quickly establish whether the unusual "pass a file with a translation table in it" design was acceptable to users. Paul

On Mon, Oct 24, 2016 at 07:39:16PM +0200, Mikhail V wrote:
That's an interesting concept for "user friendly". Apart from functions that are actually designed to read files of a particular format, can you think of any built-in functions that take a file as argument? This is how you would use this "user friendly version of translate": path = '/tmp/table' # hope no other program is using it... with open(path, 'w') as f: f.write('97 {65}\n') f.write('98 {66}\n') f.write('99 {67}\n') with open(path, 'r') as f: new_string = old_string.newtranslate(f, False, True) Compared to the existing solution: new_string = old_string.translate(str.maketrans('abc', 'ABC')) Mikhail, I appreciate that you have many ideas and want to share them, but try to think about how those ideas would work. The Python standard library is full of really well-designed programming interfaces. You can learn a lot by thinking "what existing function is this like? how does that existing function work?". str.translate and str.maketrans already exist. Look at how maketrans builds a translation table: it can take either two equal length strings, and maps characters in one to the equivalent character in the other: str.maketrans('abc', 'ABC') Or it can take a mapping (usually a dict) that maps either characters or ordinal numbers to a new string (not just a single character, but an arbitrary string) or ordinal numbers. str.maketrans({'a': 'A', 98: 66, 0x63: 0x:43}) (or None, to delete them). Note the flexibility: you don't need to specify ahead of time whether you are specifying the ordinal value as a decimal, hex, octal or binary value. Any expression that evaluates to a string or a int within the legal range is valid. That's a good programming interface. Could it be better? Perhaps. I've suggested that maybe translate could automatically call maketrans if given more than one argument. Maybe there's an easier way to just delete unwanted characters. Perhaps there could be a way to say "any character not in the translation table should be dropped". These are interesting questions.
Further thoughts: for 8-bit strings this should be simple to implement I think.
I doubt that these new features will be added to bytes as well as strings. For 8-bits byte strings, it is easy enough to generate your own translation and deletion tables -- there are only 256 values to consider.
There are no 16-bit strings. Unicode is a 21-bit encoding, usually encoded as either fixed-width sequence of 4-byte code units (UTF-32) or a variable-width sequence of 2-byte (UTF-16) or 1-byte (UTF-8) code units. But it absolutely is not a "16-bit string". [...]
but as said I don't like very much the idea and would be OK for me to use numeric values only.
I think you are very possibly the only Python programmer in the world who thinks that writing decimal ordinal values is more user-friendly than writing the actual character itself. I know I would much rather see $, π or ╔ than 36, 960 or 9556. -- Steve

On 25 October 2016 at 04:37, Steven D'Aprano <steve@pearwood.info> wrote:
Hi Steven, Thank you for the reply. I agree the idea with the file is not good, I already agreed with that and that was pointed by others too. Of course it is up to me how do I store the table. I will try to be more precise with my ideas ;) The new str.translate() interface is indeed much more versatile and provides good ways to define the table.
(or None, to delete them). Note the flexibility: you don't need to
Good. But of course if I do it with big tables, I would anyway need to parse them from some table file. Typing all values direct in code is not a comfortable way. This again should make it clear how I become the "None" value after parsing the table from plain format like 97:[nothin here] (another point for my research).
So my previous thought on it was, that there could be set of such functions: str.translate_keep(table) - this is current translate, namely keeps non-defined chars untouched str.translate_drop(table) - all the same, but dropping non-defined chars Probaly also a pair of functions without translation: str.remove(chars) - removes given chars str.keep(chars) - removes all, except chars Motivation is that those can be optimised for speed and I suppose those can work faster than re.sub(). The question is how common are these tasks, I don't have any statistics regarding this.
So in general case they should expand to 32 bit unsigned integers if I understand correctly? IIRC, Windows uses UTF16 for filenames. Anyway I will not pretend I can give any ideas regarding optimising thing there. It is just that I tend to treat those translate/filter functions as purely numeric, so I should be able to use those on any data chunk without thinking, if it is a text or not, this implies of course I must be sure that units are expanded to fixed bytesize.
Yeah I am strange. This however gives you guarantee for any environment that you can see and input them ans save the work in ASCII. Mikhail

Mikhail V writes:
Good. But of course if I do it with big tables, I would anyway need to parse them from some table file.
That is the kind of thing we can dismiss (for now) as a "SMOP" = "simple matter of programming". You know how to do it, we know how to do it, if it needs optimization, we can do it later. The part that requires discussion is the API design.
Others are more expert than I, but as I understand it, Python's function calls are expensive enough that dispatching to internal routines based on types of arguments adds negligible overhead. Optimization also can wait. That said, multiple methods is a valid option for the API. Eg, Guido generally prefers that distinctions that can't be made on type of arguments (such as translate_keep vs translate_drop) be done by giving different names rather than a flag argument. Do you *like* this API, or was this motivated primarily by the possibilities you see for optimization?
The question is how common are these tasks, I don't have any statistics regarding this.
Frequency is useful information, but if you don't have it, don't worry about it.
So in general case they should expand to 32 bit unsigned integers if I understand correctly?
No. The internal string representation is described here: https://www.python.org/dev/peps/pep-0393/. As in the Unicode standard itself, you should think of characters as integers. Yes, with PEP 393 you can deduce the representation of a string from its contents, but you can't guess for individual characters in a longer string -- the whole string has the width needed for its widest character.
The width is constant for any given string. However, I don't see at this point that you'll need more than the functions available in Python already, plus one or more wrappers to marshal the information your API accepts to the data that str.translate wants. Of course later it may be worthwhile to rewrite the wrapper in C and merge it into the existing str.translate(), or the multiple methods you suggest above.
This is not going to be a problem if you're running Python and can enter the program and digits. In any case, the API is going to have to be convenient for all the people who expect that they will never again be reduced to a hex keypad and 7-segment display.

On 25 October 2016 at 19:10, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Certainly I like the look of distinct functions more. It allows me to visually parse the code effectively, so e.g. for str.remove() I would not need to look in docs to understand what the function does. It has its downside of course, since new definitions can accidentally be similar to current ones, so more names, more the probability that no good names are left. Speed is not so important for majority of cases, at least for my current tasks. However if I'll need to process very large texts (seems like I will), speed will be more important.
Just in some cases I need to convert them to numpy arrays back and forth, so this unicode vanity worries me a bit. But I cannot clearly explain why exactly I need this.
Here I will dare to make a lyrical degression again. It could have made an impression that I am stuck in nineties or something. But that is not the case. In nineties I used the PC mostly to play Duke Nukem (yeh big times!). And all the more I hadnt any idea what is efficiency of information representation and readability. Now I kind of realize it. So I am just not the one who believes in these maximalistical "we need over 9000 glyphs" talks. And, somewhat prophetic view on this: with the come of cyber era this all be flushed so fast, that all this diligences around unicode could look funny actually. And a hex keypad will not sound "retro" but "brand new". In other words: I feel really strong that nothin besides standard characters must appear in code sources. If one wants to process unicode, then parse them as resources. So please, at least out of respect to rationally minded, don't make a code look like a christmas-tree. BTW, I use VIM to code actually so anyway I will not see them in my code. Mikhail

Mikhail V writes:
OK, as I said, you're in accord with Guido on that. His rationale is somewhat different, but that's OK.
Just in some cases I need to convert them to numpy arrays back and forth, so this unicode vanity worries me a bit.
I think you're borrowing trouble you actually don't have. Either way, the rest of the world *needs* Unicode to do their work, and it's not going to go away. On the positive side, turning a string into a list of codepoints is trivial: [ord(c) for c in string]
So I am just not the one who believes in these maximalistical "we need over 9000 glyphs" talks.
But you don't need to believe in it. What you do need to believe is that the rest of us believe that we need the union of our character sets as a single, universal character set. As it happens, although there are differences of opinion over how to handle Unicode in Python, there is consensus that Python does have to handle Unicode flexibly, effectively and efficiently. Believe me, it *is* a consensus. If you insist on bucking it, you'll have to do it pretty much alone, perhaps even maintaining your own fork of Python.

On 26 October 2016 at 20:58, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
All OK now?
Not really. I tried with a simple example intab = "ae" outtab = "XM" table = string.maketrans(intab, outtab) collections.defaultdict(lambda: None, **table) an this gives me TypeError: type object argument after ** must be a mapping, not str But I probably I misunderstood the idea. Anyway this code does not make much sence to me, I would never in life understand what is meant here. And in my not so big, but not so small, Python experience I *never* had an occasion using collections or lambda.
I was merely talking about syntax and sources files standard, not about unicode strings. No doubt one needs some way to store different glyph sets. So I was talking about that if one defines a syntax and has good intentions for readability in mind, there is not so many rationale to adopt the syntax to current "hybrid" system: 7-bit and/or multibyte paradigm. Again this a too far going discussion, but one should not probably much look ahead on those. The situation is not so good in this sense that most standard software is attached to this strange paradigm (even those which does not have anything to do with multi-lingual typography). So IMO something gone wrong with those standard characters.
As for me I would take the path of developing of own IDE which will enable typografic quality rendering and of course all useful glyphs, such as curly quotes, bullets, etc, which all is fundamental to any possible improvements of cognitive qualities of code. And I'll stay in 8-bit boundaries, thats for sure. So if Python will take the path of "unicode" code input (e.g. for some punctuaion characters) this would only add a minor issue for generating valid Python source files in this case. Mikhail

I"ve lost track of what (If anything) is actually being proposed here... so I"m going to try a quick summary: 1) an easy way to spell "remove all the characters other than these" I think that's a good idea. What with unicode having an enormous number of code points, it really does make sense to have a way to specify only what you want, rather than what you don't want. Back in the good old days of 1-byte chars, it wasn't hard to build up a full 256 element translate table -- not so much anymore. And one of the whole points of str.translate() is good performance. a) a new method: str.remove_all_but(sequence_of_chars) (naming TBD) b) a new flag in translate (Kind of like the decode keywords) str.translate(table, missing='ignore'|'remove') (b) has the advantage of adding translation and removal in one fell swoop -- but if you only want to remove, then you have to make a translation table of 1:1 mappings = not hard, but a annoying: table = {c:c for c in sequence_of_chars} I'm on the fence about what I personally prefer. 2) (in another thread, but similar enough) being able to pass in more than one string to replace: str.replace( old=seq_of_strings, new=seq_of_strings ) I know I've wanted this a lot, and certainly from a performance perspective, it could be a nice bonus. But: It overlaps a lot with str.translate -- at least for single character replacements. so really why? so it would really only make sense if supported multi-char strings: str.replace(old = ("aword", "another_word"), ("something", "something else")) However: a string IS a sequence of strings, so we'd have confusion about that: str.replace("this", "four") Does the user want the word "this" replaced with the word "four" -- or do they want each character replaced? Maybe we'd need a .replace_many() method? ugh! There are also other issues with what to di with repeated / overlapping cahractors: str.replace( ("aaa", "a", "b"), ("b", "bbb", "a") and all sort of other complications! THAT I think could be nailed down by defining the "order of operations" Does it lop through the entire string for each item? or through each item for each point in the string? note that if you loop thorugh the entire string for each item, you might as well have written the loop yourself: for old, new in sip(old_list, new_list): s = s.replace(old, new)) and at least if the length of the string si long-ish, and the number of replacements short-ish -- performance would be fine. *** So the question is -- is there support for these enhancements? If so, then it would be worth hashing ot the details. But the next question is -- does anyone care enough to manage that process -- it'll be a lot of work! NOTE: there has also been a fair bit of discussion in this thread about ordinals vs characters, and unicode itself -- I don't think any of that resulted in any possible proposals... -CHB On Wed, Oct 26, 2016 at 2:48 PM, Mikhail V <mikhailwas@gmail.com> wrote:
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 2016-10-26 23:17, Chris Barker wrote:
c) pass a function that returns the replacement: def replace(c): return c.upper() if c.isalpha() else '' str.translate(replace) The replacement function could be called only on distinct codepoints.
Possible choices are: 1) Use the given order. 2) Check from the longest to the shortest. If you're going to pick choice 2, does it have to be 2 tuples/lists? Why not a dict instead?
[snip]

On Wed, Oct 26, 2016 at 3:48 PM, MRAB <python@mrabarnett.plus.com> wrote:
then we have a string.translate() that accepts a table of string replacements, rather than individual character replacements -- maybe a good idea! -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 27 October 2016 at 00:17, Chris Barker <chris.barker@noaa.gov> wrote:
Exactly that is the proposal. And for same exact reason that you point out, I also can't give a comment what would be better. It would be indeed quite strange from syntactical POV if I just want to remove "all except" and must call translate(). So ideally both should exist I think. Mikhail

On Wed, Oct 26, 2016 at 5:32 PM, Mikhail V <mikhailwas@gmail.com> wrote:
That kind of violate OWTDI though. Probably one's enough. and if fact with the use-cases I can think of, and the one you mentioned, they are really two steps: there are the characters you want to translate, and the ones you want to keep, but the ones you want to keep are a superset of the ones you want to translate. so if we added the "remove"option to .translate(), then you would need to add all the "keep" charactors to your translate table. I'm thinking they really are different operations, give them a different method. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 10/26/2016 6:17 PM, Chris Barker wrote:
In other words, 'only keep these'. We already have easy ways to create filtered strings.
I expect the first to be a bit faster. Either can be wrapped in a keep() function. If one has a translation dictionary d, use that in twice in the genexp.
-- Terry Jan Reedy

On Fri, Oct 28, 2016 at 7:28 AM, Terry Reedy <tjreedy@udel.edu> wrote:
s = 'kjskljkxcvnalsfjaweirKJZknzsnlkjsvnskjszsdscccjasfdjf' s2 = ''.join(c for c in s if c in set('abc'))
pretty slick -- but any hope of it being as fast as a C implemented method? for example, with a 1000 char string: In [59]: % timeit string.translate(table) 100000 loops, best of 3: 3.62 µs per loop In [60]: % timeit ''.join(c for c in string if c in set(letters)) 1000 loops, best of 3: 1.14 ms per loop so the translate() method is about 300 times faster in this case. (and it used a defaultdict with a None factory, which is probably a bit slower than a pure C implementation might be. I've always figured that Python's rich string methods provided two things: 1) single method call to do common things 2) nice fast, pure C performance so I think a "keep these" method would help with both of these goals. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Chris Barker writes:
pretty slick -- but any hope of it being as fast as a C implemented method?
I would expect not in CPython, but if "fast" matters, why are you using CPython rather than PyPy or Cython? If it matters *that* much, you can afford to write your own C implementation. But I doubt that fast matters "that much" often enough to be worth maintaining yet another string method in Python. Byte-shoveling servers might want it for bytes, though.
Sure, but the translate method already gives you that, and a lot more. Note that when you're talking about working with Unicode characters, no natural language activity I can imagine (not even translating Buddhist texts, which involves a couple of Indian scripts as well as Han ideographs) uses more than a fraction of defined characters. So really translate with defaultdict is a specialized loop that marries an algorithmic body (which could do things like look up the original script or other character properties to decide on the replacement for the generic case) with a (usually "small") table of exceptions. That seems like inspired design to me.

On Tue, Nov 1, 2016 at 12:15 AM, Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
oh come on!
If it matters *that* much, you can afford to write your own C implementation.
This is about a possible addition to the stdlib -- me writing my own C implementation has nothing to do with it.
This could be said about every string method in Python -- I understand that every addition is more code to maintain. But somehow we are adding all kinds of stuff like yet another string formatting method, talking about null coalescing operators and who knows what -- those are all a MUCH larger burden -- not just for maintaining the interpreter, but for everyone using python having more to remember and understand. On the other hand, powerful and performant string methods are a major plus for Python -- a good reason to us it over Perl :-) So an new one that provides, as I write before:
1) single method call to do a common thing
2) nice fast, pure C performance
would fit right into to Python, and indeed, would be a similar implementation to existing methods -- so the maintenance burden would be a small addition (i.e if the internal representation for strings changed, all those methods would need re-visiting and similar changes) So the only key question is -- is the a common enough use case?
yes but only with the fairly esoteric use of defaultdict. which brings me back to the above: 1) single method call to do a common thing the nice thing about a single method call is discoverability -- no newbie is going to figure out the .translate + defaultdict approach.
which is why you may want to remove all the others :-) So really translate with defaultdict is a specialized loop that
indeed -- .translate() itself is remarkably flexible -- you could even pas in a custom class that does all sorts of logic. and adding the defaultdict is an easy way to add a useful feature. But again, advanced usage and not very discoverable. Maybe that means we need some more docs and/or perhaps recipes instead. Anyway, I joined this thread to clarify what might be on the table -- but while I think it's a good idea, I dont have the bandwidth to move it through the process -- so unless someone steps up that does, we're done. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 27 October 2016 at 00:17, Chris Barker <chris.barker@noaa.gov> wrote:
Actually even with ASCII (read for python 2.7) I would also be happy to have such function: say I just want to keep only digits so I write: digits = "0123456789" newstring = somestring.keep(digits) Despite I can do it other way, this would be much simpler and clearer way to do it. And I suppose it is quite common task not only for me. Currently 99% of my programs are in python 2.7. And I started to use python 3 only for tasks when I want to process unicode strings (ironically only to get rid of unicode). Mikhail

On Wed, Nov 2, 2016 at 12:02 PM, Mikhail V <mikhailwas@gmail.com> wrote:
well, with ascii, it's not too hard to make a translation table: digits = "0123456789" table = [(o if chr(o) in digits else None )for o in range(256)] s = "some stuff and some 456 23 numbers 888" s.translate(table) '45623888' but then there is the defaultdict way: s.translate(defaultdict(*lambda*: *None*, {ord(c):c *for* c *in* digits}.items())) '45623888' wasn't that easy? Granted, if you need to do this, you'd wrap it in a function like Chris A. Suggested. But this really isn't easy or discoverable -- it took me a fair bit of fidlding to get right, and I knew I was looking for a defaultdict implementation. Also: In [43]: table Out[43]: defaultdict(<function __main__.<lambda>>, {48: '0', 49: '1', 50: '2', 51: '3', 52: '4', 53: '5', 54: '6', 55: '7', 56: '8', 57: '9'}) In [44]: s.translate(table) Out[44]: '45623888' In [45]: table Out[45]: defaultdict(<function __main__.<lambda>>, {32: None, 48: '0', 49: '1', 50: '2', 51: '3', 52: '4', 53: '5', 54: '6', 55: '7', 56: '8', 57: '9', 97: None, 98: None, 100: None, 101: None, 102: None, 109: None, 110: None, 111: None, 114: None, 115: None, 116: None, 117: None}) defaultdict puts an entry in for every ordinal checked -- this could get big -- granted, probaly nt a big deal with modern computer memory, but still... it might even be worth making a NoneDict for this: class NoneDict(dict): """ Dictionary implementation that always returns None when a key is not in the dict, rather than raising a KeyError """ def __getitem__(self, key): try: val = dict.__getitem__(self, key) except KeyError: val = None return val (see enclosed -- it works fine with translate) (OK, that was fun, but no, not really that useful) Despite I can do it other way, this would be much simpler and clearer
way to do it. And I suppose it is quite common task not only for me.
That's the key question -- is this a common task? If so, then whie there are ways to do it, they're not easy nor discoverable. And while some of the guiding principles of this list are: "not every two line function needs to be in the standard lib" and "put it up on PYPi, and see if a lot of people find it useful" It's actually kind of silly to put a single function up as a PyPi package -- and I doubt many people will find it if you did. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Thu, Oct 27, 2016 at 8:48 AM, Mikhail V <mikhailwas@gmail.com> wrote:
You're 99% of the way to understanding it. Try the exercise again in Python 3. You don't have string.maketrans (which creates a 256-byte translation mapping) - instead, you use a dictionary. ChrisA

return string.translate(collections.defaultdict(lambda: None, **table))
Nice! I forgot about defautdict -- so this just needs a recipe somewhere -- maybe even in the docs for str.translate. BTW, great use case for defautdict -- I had been wondering what the point was, given that a regular dict as .setdefault -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Tue, Oct 25, 2016 at 05:15:58PM +0200, Mikhail V wrote: [...]
Why not? What is the difference between typing 123: 456 124: 457 125: 458 # two hundred more lines in a "table.txt" file, and typing: { 123: 456, 124: 457, 125: 458, # two hundred more lines } in a "table.py" file? The difference is insignificant. And the Python version can be cleaned up: for i in range(123, 333): table[i] = 456 - 123 + i Not all data whould be written as code, especially if you expect unskilled users to edit it, but generating data directly in code is a very powerful technique, and the strict syntax of the programming language helps prevent some errors. [...]
Motivation is that those can be optimised for speed
That's not a motivation. Why are you talking about "optimizing for speed" functions that we have not yet established are needed? That reminds me of a story I once heard of somebody who was driving across the desert in the US once. One of his passengers noticed the highway signs and said "Wait, aren't we going the wrong way?" The driver replied "Who cares, we're making fantastic time!" Optimizing a function you don't need is not an optimization. It is a waste of time. -- Steve

On Wed, Oct 26, 2016 at 04:29:13AM +0200, Mikhail V wrote:
I need translate() which drops non-defined chars. Please :) No optimisation, no new syntax. deal?
I still wonder whether this might be worth introducing as a new string method, or an option to translate. But the earliest that will happen is Python 3.7, so in the meantime, something like this should be enough: # untested keep = "abcdßαβπд∞" text = "..." # Find all the characters in text that are not in keep: delchars = set(text) - set(keep) delchars = ''.join(delchars) text = text.translate(str.maketrans("", "", delchars)) -- Steve

Mikhail V writes:
I need translate() which drops non-defined chars. Please :)
import collections def translate_or_drop(string, table): """ string: a string to process table: a dict as accepted by str.translate """ return string.translate(collections.defaultdict(lambda: None, **table)) All OK now?
participants (10)
-
Chris Angelico
-
Chris Barker
-
Chris Barker - NOAA Federal
-
Mikhail V
-
MRAB
-
Paul Moore
-
Ryan Birmingham
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy