Re: New explicit methods to trim strings

Barry
I don't understand the last sentence. I had in mind a case where you might want to remove repetitions of an affix without knowning how many there were (possibly none).
Yep could have worded better. I was wonder what example would need the count and if when you have the count its easy to solve another way. I'm less sure have thought more about it. As you noted later str.replace() has a count. So why by prefix/suffix striping? Barry
Rob Cliffe

On Sun, Mar 22, 2020 at 2:08 AM Barry Scott <barry@barrys-emacs.org> wrote:
I imagine that the count=1 is the most common use case for replace() anyway, So it seems it would be useful to have a way to select either "one" or "all". Once we have that, why not "count", and -1 means "all", just like it does for .replace() -- after all, why introduce yet another API? That being said, I'd be just as happy with only one. On a related note, I just noticed: In [8]: s.replace('a', 'x', count=2) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-8-ef1478a5d3cb> in <module> ----> 1 s.replace('a', 'x', count=2) TypeError: replace() takes no keyword arguments having count be a keyword parameter seems like the natural API to me. Is is just legacy that it's not? Is there a good reason not to make it a keyword parameter? (it is optional). Frankly, that's always been confusing -- particularly as until 3.8 you couldn't make a function with a default and not a keyword at all. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Mon, Mar 23, 2020 at 3:53 AM Christopher Barker <pythonchb@gmail.com> wrote:
Do you mean "other than not specifying the count", or do you actually mean that it's more common than replacing all? Because in my experience, replacing all is *by far* the most common case - but yes, replacing just one would be the next most common. ChrisA

On Sun, Mar 22, 2020 at 10:08 AM Chris Angelico <rosuav@gmail.com> wrote:
yes, that -- most common use case for using count at all. I'm suggesting that if -1 and 1 were the only options, very few people would notice :-)
Because in my experience, replacing all is *by far* the most common case
Agreed -- that's why it's a good default. I don't know that I ever even noticed that count was there before now :-) -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Sun, Mar 22, 2020 at 9:54 AM Christopher Barker <pythonchb@gmail.com> wrote:
Please, please. removeprefix/removesuffix do not need a count. The use case is quite different from that of replace. And they should only remove (at most) one prefix or suffix. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Sun, 22 Mar 2020 at 17:58, Guido van Rossum <guido@python.org> wrote:
+1 from me. These should be simple functions to remove a prefix/suffix (note "a prefix" = "one prefix"). Let's not over-engineer them. I've needed to remove one prefix/suffix. I've never needed to remove more than one. Paul

Stephen J. Turnbull wrote:
The only cases I can remember are files named things like "thesis.doc.doc" in GUI environments. ;-)
For edge cases like that, something like `"thesis.doc.doc".removesuffix(".doc").removesuffix(".doc")` should suffice, no? It may not be the cleanest looking solution, but IMO, it's better than over-complicating the method for something that would be used rarely at best. On Mon, Mar 23, 2020 at 1:34 AM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:

I personally think that there is a better case for an ignore_case flag – the number of times that I have been caught out with [‘a.doc’, ‘b.Doc’, ‘c.DOC’] especially on MS platforms. Steve Barnes From: Kyle Stanley <aeros167@gmail.com> Sent: 23 March 2020 05:49 To: Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> Cc: python-ideas <python-ideas@python.org> Subject: [Python-ideas] Re: New explicit methods to trim strings Stephen J. Turnbull wrote:
The only cases I can remember are files named things like "thesis.doc.doc" in GUI environments. ;-)
For edge cases like that, something like `"thesis.doc.doc".removesuffix(".doc").removesuffix(".doc")` should suffice, no? It may not be the cleanest looking solution, but IMO, it's better than over-complicating the method for something that would be used rarely at best. On Mon, Mar 23, 2020 at 1:34 AM Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp<mailto:turnbull.stephen.fw@u.tsukuba.ac.jp>> wrote: Paul Moore writes:
I've needed to remove one prefix/suffix. I've never needed to remove more than one.
The only cases I can remember are files named things like "thesis.doc.doc" in GUI environments. ;-) _______________________________________________ Python-ideas mailing list -- python-ideas@python.org<mailto:python-ideas@python.org> To unsubscribe send an email to python-ideas-leave@python.org<mailto:python-ideas-leave@python.org> https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/NT3ORR... Code of Conduct: http://python.org/psf/codeofconduct/

On Mon, Mar 23, 2020 at 5:06 PM Steve Barnes <GadgetSteve@live.co.uk> wrote:
I personally think that there is a better case for an ignore_case flag – the number of times that I have been caught out with [‘a.doc’, ‘b.Doc’, ‘c.DOC’] especially on MS platforms.
Case insensitivity is a mess. I think it'd be a lot cleaner to keep this as-is, and for the case insensitive use-case, people can still do it manually. (Basically case-fold for the comparison, but then trim from the original as is.) ChrisA

On Mon, Mar 23, 2020 at 7:06 PM Alex Hall <alex.mojaki@gmail.com> wrote:
I think I'm missing something, why is case insensitivity a mess?
Because there are many characters that case fold in strange ways. "ıIiİ".casefold() == 'ıiii̇' which means that lowercase dotless ı doesn't casefold to the same thing that uppercase dotless I. Some characters case fold to strings of different lengths, such as "ß" which casefolds to "ss". I haven't even tried what happens with combining characters vs combined characters. And Unicode case folding is already a simplified version of reality; what actual humans expect can be even more complicated, such as (I think) German case folding rules being different for names and for book titles, and the way that umlauted letters are case folded. On the other hand, this might actually mean it's *better* to have a dedicated case-insensitive-cut-prefix operation. It would be difficult to define it in easy terms, but basically it should be such that the returned string (if not identical to the original) is the longest suffix to the original string such that, if the returned string were appended to the prefix and the result case folded, it would be the same as the original string case folded. But there could be other definitions, just as complicated, and not necessarily more correct. In any case, this can (and in my opinion should) be deferred for later. Start with the simple one that doesn't care about all these complexities, and then expand from there as the need is found. ChrisA

On 3/23/20 4:33 AM, Chris Angelico wrote:
The issue is that cases in Unicode are difficult, and can be locale dependent (Unicode calls this Tailoring). In the above example with the i-s, casefold would have needed to be told that we were dealing with the Turkish Language (or maybe some other language with the same issue), but currently the Python casefold function doesn't support the needed Tailoring (and I don't know if there is an exhaustive listing somewhere of the needed tailoring) Fully handling Unicode so as to meet all National expectations is VERY difficult, It doesn't surprise me that the Python Standard Library doesn't attempt to get it totally right, but settles for just dealing with the 'default' processing. The biggest part of this mess is that Unicode had to accept some compromises in defining Unicode (because the languages themselves present problems and inconsistencies), and when you hit a spot where the compromise goes against what you are trying to do at the moment, it gets difficult. -- Richard Damon

On Mon, Mar 23, 2020 at 10:40 PM Richard Damon <Richard@damon-family.org> wrote:
Right, which is why for a proposal like this, it's best to start with the simple and straight-forward option of case sensitivity and precise matching. Removing a prefix of "a\u0301" will not remove a leading "\xe1" and vice versa (just as those two strings don't compare equal). ChrisA

On Mar 23, 2020, at 04:51, Chris Angelico <rosuav@gmail.com> wrote:
Agreed, but I think it’s not just “to start with”, but forever, or at least as long as Python strings are sequences of Unicode code points. If "Café".startswith("Cafe\u0301") is false, "Café".stripprefix("Cafe\u0301") had better not strip anything. And as long as "é" in "Cafe\u0301" and any(ch=="é" for ch in "Cafe\u0301" are false, startswith is correct. By comparison, in Swift, "Café".hasPrefix("Cafe\u{0301}") is true, because "Cafe\u{0301}" is a sequence of four Unicode scalars, the fourth of which is 'é', as opposed to Python where it’s a sequence of five Unicode code points. And of course Swift also has a slew of methods to do things like localized vs. default case-insensitive equality, substring, etc. testing, none of which Python has, or should have, as long as its strings are made of code points rather than scalars (or EGCs or whatever).

Folks, This is now a draft PEP, and being (has been?) discussed on python-dev -- time to go there is you want more input. https://mail.python.org/archives/list/python-dev@python.org/thread/WFEWPAOVX... -CHB On Mon, Mar 23, 2020 at 12:37 PM Chris Angelico <rosuav@gmail.com> wrote:
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Christopher Barker writes:
This is now a draft PEP, and being (has been?) discussed on python-dev -- time to go there is you want more input.
They don't, and it's a "idea" separate from the PEP. Python is not going to change the semantics of str from array of code points to array of characters. That's a Python 4000 change, and I doubt anyone who experienced Python 3000 will be on board.
Maybe this would be something for the locale or unicodedata module?
-1. It should be done (by somebody != me ;-) but not in stdlib. Steve

On Mon, Mar 23, 2020 at 6:01 PM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Sure, then please starter a new thread, or at least change the title. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Mar 23, 2020, at 12:40, Chris Angelico <rosuav@gmail.com> wrote:
Maybe. But a complete suite of functions for treating strings as made of Unicode scalars or EGCs or whatever seems like a lot of design work, and I don’t know if there’s enough demand for anyone to be willing to do it. Swift is a different story for a lot of reasons (brand new language, iterator model that makes “non-randomly-accessible sequence” a sensible thing, corporate team, a much worse status quo ante where strings were sequences of UTF-16 code units, a need to interface natively with Cocoa and its NFKD decomposed strings, …). For Python, it seems like if nobody’s put anything (other than thin wrappers around ICU) on PyPI, probably nobody needs support in the stdlib.

On 3/23/20 3:31 PM, Andrew Barnert via Python-ideas wrote:
I wasn't familiar with the term Scalar as used in Unicode so I looked it up, and I think you are incorrect here. From the Glossery: Unicode Scalar Value. Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. Thus Scalar ARE just codepoints (but exclude the surrogate pairs). What you may be thinking of is the Grapheme. It may be that Swift does some automatic conversion to a canonical form, to make the strings match. In fact, just because the text displays as Café doesn't help you know how many code-points their are, as the glyph/graheme é can be expressed as either a single code point \u00E9 (NFC), or the sequence e \u0301 (NFD), and Python can express it as either. A basic rule with unicode strings, is if you are going to be doing these sorts of comparison, you should make sure you have both strings in the same normal form. -- Richard Damon

On Sun, Mar 22, 2020 at 2:08 AM Barry Scott <barry@barrys-emacs.org> wrote:
I imagine that the count=1 is the most common use case for replace() anyway, So it seems it would be useful to have a way to select either "one" or "all". Once we have that, why not "count", and -1 means "all", just like it does for .replace() -- after all, why introduce yet another API? That being said, I'd be just as happy with only one. On a related note, I just noticed: In [8]: s.replace('a', 'x', count=2) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-8-ef1478a5d3cb> in <module> ----> 1 s.replace('a', 'x', count=2) TypeError: replace() takes no keyword arguments having count be a keyword parameter seems like the natural API to me. Is is just legacy that it's not? Is there a good reason not to make it a keyword parameter? (it is optional). Frankly, that's always been confusing -- particularly as until 3.8 you couldn't make a function with a default and not a keyword at all. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Mon, Mar 23, 2020 at 3:53 AM Christopher Barker <pythonchb@gmail.com> wrote:
Do you mean "other than not specifying the count", or do you actually mean that it's more common than replacing all? Because in my experience, replacing all is *by far* the most common case - but yes, replacing just one would be the next most common. ChrisA

On Sun, Mar 22, 2020 at 10:08 AM Chris Angelico <rosuav@gmail.com> wrote:
yes, that -- most common use case for using count at all. I'm suggesting that if -1 and 1 were the only options, very few people would notice :-)
Because in my experience, replacing all is *by far* the most common case
Agreed -- that's why it's a good default. I don't know that I ever even noticed that count was there before now :-) -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Sun, Mar 22, 2020 at 9:54 AM Christopher Barker <pythonchb@gmail.com> wrote:
Please, please. removeprefix/removesuffix do not need a count. The use case is quite different from that of replace. And they should only remove (at most) one prefix or suffix. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Sun, 22 Mar 2020 at 17:58, Guido van Rossum <guido@python.org> wrote:
+1 from me. These should be simple functions to remove a prefix/suffix (note "a prefix" = "one prefix"). Let's not over-engineer them. I've needed to remove one prefix/suffix. I've never needed to remove more than one. Paul

Stephen J. Turnbull wrote:
The only cases I can remember are files named things like "thesis.doc.doc" in GUI environments. ;-)
For edge cases like that, something like `"thesis.doc.doc".removesuffix(".doc").removesuffix(".doc")` should suffice, no? It may not be the cleanest looking solution, but IMO, it's better than over-complicating the method for something that would be used rarely at best. On Mon, Mar 23, 2020 at 1:34 AM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:

I personally think that there is a better case for an ignore_case flag – the number of times that I have been caught out with [‘a.doc’, ‘b.Doc’, ‘c.DOC’] especially on MS platforms. Steve Barnes From: Kyle Stanley <aeros167@gmail.com> Sent: 23 March 2020 05:49 To: Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> Cc: python-ideas <python-ideas@python.org> Subject: [Python-ideas] Re: New explicit methods to trim strings Stephen J. Turnbull wrote:
The only cases I can remember are files named things like "thesis.doc.doc" in GUI environments. ;-)
For edge cases like that, something like `"thesis.doc.doc".removesuffix(".doc").removesuffix(".doc")` should suffice, no? It may not be the cleanest looking solution, but IMO, it's better than over-complicating the method for something that would be used rarely at best. On Mon, Mar 23, 2020 at 1:34 AM Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp<mailto:turnbull.stephen.fw@u.tsukuba.ac.jp>> wrote: Paul Moore writes:
I've needed to remove one prefix/suffix. I've never needed to remove more than one.
The only cases I can remember are files named things like "thesis.doc.doc" in GUI environments. ;-) _______________________________________________ Python-ideas mailing list -- python-ideas@python.org<mailto:python-ideas@python.org> To unsubscribe send an email to python-ideas-leave@python.org<mailto:python-ideas-leave@python.org> https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/NT3ORR... Code of Conduct: http://python.org/psf/codeofconduct/

On Mon, Mar 23, 2020 at 5:06 PM Steve Barnes <GadgetSteve@live.co.uk> wrote:
I personally think that there is a better case for an ignore_case flag – the number of times that I have been caught out with [‘a.doc’, ‘b.Doc’, ‘c.DOC’] especially on MS platforms.
Case insensitivity is a mess. I think it'd be a lot cleaner to keep this as-is, and for the case insensitive use-case, people can still do it manually. (Basically case-fold for the comparison, but then trim from the original as is.) ChrisA

On Mon, Mar 23, 2020 at 7:06 PM Alex Hall <alex.mojaki@gmail.com> wrote:
I think I'm missing something, why is case insensitivity a mess?
Because there are many characters that case fold in strange ways. "ıIiİ".casefold() == 'ıiii̇' which means that lowercase dotless ı doesn't casefold to the same thing that uppercase dotless I. Some characters case fold to strings of different lengths, such as "ß" which casefolds to "ss". I haven't even tried what happens with combining characters vs combined characters. And Unicode case folding is already a simplified version of reality; what actual humans expect can be even more complicated, such as (I think) German case folding rules being different for names and for book titles, and the way that umlauted letters are case folded. On the other hand, this might actually mean it's *better* to have a dedicated case-insensitive-cut-prefix operation. It would be difficult to define it in easy terms, but basically it should be such that the returned string (if not identical to the original) is the longest suffix to the original string such that, if the returned string were appended to the prefix and the result case folded, it would be the same as the original string case folded. But there could be other definitions, just as complicated, and not necessarily more correct. In any case, this can (and in my opinion should) be deferred for later. Start with the simple one that doesn't care about all these complexities, and then expand from there as the need is found. ChrisA

On 3/23/20 4:33 AM, Chris Angelico wrote:
The issue is that cases in Unicode are difficult, and can be locale dependent (Unicode calls this Tailoring). In the above example with the i-s, casefold would have needed to be told that we were dealing with the Turkish Language (or maybe some other language with the same issue), but currently the Python casefold function doesn't support the needed Tailoring (and I don't know if there is an exhaustive listing somewhere of the needed tailoring) Fully handling Unicode so as to meet all National expectations is VERY difficult, It doesn't surprise me that the Python Standard Library doesn't attempt to get it totally right, but settles for just dealing with the 'default' processing. The biggest part of this mess is that Unicode had to accept some compromises in defining Unicode (because the languages themselves present problems and inconsistencies), and when you hit a spot where the compromise goes against what you are trying to do at the moment, it gets difficult. -- Richard Damon

On Mon, Mar 23, 2020 at 10:40 PM Richard Damon <Richard@damon-family.org> wrote:
Right, which is why for a proposal like this, it's best to start with the simple and straight-forward option of case sensitivity and precise matching. Removing a prefix of "a\u0301" will not remove a leading "\xe1" and vice versa (just as those two strings don't compare equal). ChrisA

On Mar 23, 2020, at 04:51, Chris Angelico <rosuav@gmail.com> wrote:
Agreed, but I think it’s not just “to start with”, but forever, or at least as long as Python strings are sequences of Unicode code points. If "Café".startswith("Cafe\u0301") is false, "Café".stripprefix("Cafe\u0301") had better not strip anything. And as long as "é" in "Cafe\u0301" and any(ch=="é" for ch in "Cafe\u0301" are false, startswith is correct. By comparison, in Swift, "Café".hasPrefix("Cafe\u{0301}") is true, because "Cafe\u{0301}" is a sequence of four Unicode scalars, the fourth of which is 'é', as opposed to Python where it’s a sequence of five Unicode code points. And of course Swift also has a slew of methods to do things like localized vs. default case-insensitive equality, substring, etc. testing, none of which Python has, or should have, as long as its strings are made of code points rather than scalars (or EGCs or whatever).

Folks, This is now a draft PEP, and being (has been?) discussed on python-dev -- time to go there is you want more input. https://mail.python.org/archives/list/python-dev@python.org/thread/WFEWPAOVX... -CHB On Mon, Mar 23, 2020 at 12:37 PM Chris Angelico <rosuav@gmail.com> wrote:
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Christopher Barker writes:
This is now a draft PEP, and being (has been?) discussed on python-dev -- time to go there is you want more input.
They don't, and it's a "idea" separate from the PEP. Python is not going to change the semantics of str from array of code points to array of characters. That's a Python 4000 change, and I doubt anyone who experienced Python 3000 will be on board.
Maybe this would be something for the locale or unicodedata module?
-1. It should be done (by somebody != me ;-) but not in stdlib. Steve

On Mon, Mar 23, 2020 at 6:01 PM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Sure, then please starter a new thread, or at least change the title. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Mar 23, 2020, at 12:40, Chris Angelico <rosuav@gmail.com> wrote:
Maybe. But a complete suite of functions for treating strings as made of Unicode scalars or EGCs or whatever seems like a lot of design work, and I don’t know if there’s enough demand for anyone to be willing to do it. Swift is a different story for a lot of reasons (brand new language, iterator model that makes “non-randomly-accessible sequence” a sensible thing, corporate team, a much worse status quo ante where strings were sequences of UTF-16 code units, a need to interface natively with Cocoa and its NFKD decomposed strings, …). For Python, it seems like if nobody’s put anything (other than thin wrappers around ICU) on PyPI, probably nobody needs support in the stdlib.

On 3/23/20 3:31 PM, Andrew Barnert via Python-ideas wrote:
I wasn't familiar with the term Scalar as used in Unicode so I looked it up, and I think you are incorrect here. From the Glossery: Unicode Scalar Value. Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. Thus Scalar ARE just codepoints (but exclude the surrogate pairs). What you may be thinking of is the Grapheme. It may be that Swift does some automatic conversion to a canonical form, to make the strings match. In fact, just because the text displays as Café doesn't help you know how many code-points their are, as the glyph/graheme é can be expressed as either a single code point \u00E9 (NFC), or the sequence e \u0301 (NFD), and Python can express it as either. A basic rule with unicode strings, is if you are going to be doing these sorts of comparison, you should make sure you have both strings in the same normal form. -- Richard Damon
participants (11)
-
Alex Hall
-
Andrew Barnert
-
Barry Scott
-
Chris Angelico
-
Christopher Barker
-
Guido van Rossum
-
Kyle Stanley
-
Paul Moore
-
Richard Damon
-
Stephen J. Turnbull
-
Steve Barnes