Re: New explicit methods to trim strings

On 19 Mar 2020, at 22:12, Rob Cliffe rob.cliffe@btinternet.com wrote:
On 18/03/2020 20:16, Barry Scott wrote:
On 18 Mar 2020, at 18:03, Rob Cliffe via Python-ideas python-ideas@python.org wrote:
Consider that the start or end of a string may contain repetitions of an affix.
Should `-+-+-+Spam'.stripprefix('-+') remove just the first occurence? All of them? Does it need a 'count' parameter?
The only ways to use this function without counting is remove 1 prefix or remove all. As Alex said 1 prefix is the common case. For the all case there are existing ways to do it.
If you are counting the number of prefix occurrences that exist you can simple slice the answer without the strip prefix function.
Barry
I don't understand the last sentence. I had in mind a case where you might want to remove repetitions of an affix without knowning how many there were (possibly none).
Yep could have worded better. I was wonder what example would need the count and if when you have the count its easy to solve another way. I'm less sure have thought more about it.
As you noted later str.replace() has a count. So why by prefix/suffix striping?
Barry
Rob Cliffe

On Sun, Mar 22, 2020 at 2:08 AM Barry Scott barry@barrys-emacs.org wrote:
Should `-+-+-+Spam'.stripprefix('-+') remove just the first
occurence? All of them? Does it need a 'count' parameter?
The only ways to use this function without counting is remove 1 prefix
or remove all.
I imagine that the count=1 is the most common use case for replace() anyway,
So it seems it would be useful to have a way to select either "one" or "all". Once we have that, why not "count", and -1 means "all", just like it does for .replace() -- after all, why introduce yet another API?
That being said, I'd be just as happy with only one.
On a related note, I just noticed:
In [8]: s.replace('a', 'x', count=2)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-8-ef1478a5d3cb> in <module> ----> 1 s.replace('a', 'x', count=2)
TypeError: replace() takes no keyword arguments
having count be a keyword parameter seems like the natural API to me. Is is just legacy that it's not? Is there a good reason not to make it a keyword parameter? (it is optional).
Frankly, that's always been confusing -- particularly as until 3.8 you couldn't make a function with a default and not a keyword at all.
-CHB

On Mon, Mar 23, 2020 at 3:53 AM Christopher Barker pythonchb@gmail.com wrote:
On Sun, Mar 22, 2020 at 2:08 AM Barry Scott barry@barrys-emacs.org wrote:
Should `-+-+-+Spam'.stripprefix('-+') remove just the first occurence? All of them? Does it need a 'count' parameter?
The only ways to use this function without counting is remove 1 prefix or remove all.
I imagine that the count=1 is the most common use case for replace() anyway,
Do you mean "other than not specifying the count", or do you actually mean that it's more common than replacing all? Because in my experience, replacing all is *by far* the most common case - but yes, replacing just one would be the next most common.
ChrisA

On Sun, Mar 22, 2020 at 10:08 AM Chris Angelico rosuav@gmail.com wrote:
I imagine that the count=1 is the most common use case for replace()
anyway,
Do you mean "other than not specifying the count",
yes, that -- most common use case for using count at all.
I'm suggesting that if -1 and 1 were the only options, very few people would notice :-)
Because in my experience, replacing all is *by far* the most common case
Agreed -- that's why it's a good default. I don't know that I ever even noticed that count was there before now :-)
-CHB

On Sun, Mar 22, 2020 at 9:54 AM Christopher Barker pythonchb@gmail.com wrote:
On Sun, Mar 22, 2020 at 2:08 AM Barry Scott barry@barrys-emacs.org wrote:
Should `-+-+-+Spam'.stripprefix('-+') remove just the first
occurence? All of them? Does it need a 'count' parameter?
The only ways to use this function without counting is remove 1 prefix
or remove all.
Please, please. removeprefix/removesuffix do not need a count. The use case is quite different from that of replace. And they should only remove (at most) one prefix or suffix.

On Sun, 22 Mar 2020 at 17:58, Guido van Rossum guido@python.org wrote:
On Sun, Mar 22, 2020 at 9:54 AM Christopher Barker pythonchb@gmail.com wrote:
On Sun, Mar 22, 2020 at 2:08 AM Barry Scott barry@barrys-emacs.org wrote:
Should `-+-+-+Spam'.stripprefix('-+') remove just the first occurence? All of them? Does it need a 'count' parameter?
The only ways to use this function without counting is remove 1 prefix or remove all.
Please, please. removeprefix/removesuffix do not need a count. The use case is quite different from that of replace. And they should only remove (at most) one prefix or suffix.
+1 from me. These should be simple functions to remove a prefix/suffix (note "a prefix" = "one prefix"). Let's not over-engineer them.
I've needed to remove one prefix/suffix. I've never needed to remove more than one. Paul

Stephen J. Turnbull wrote:
The only cases I can remember are files named things like "thesis.doc.doc" in GUI environments. ;-)
For edge cases like that, something like `"thesis.doc.doc".removesuffix(".doc").removesuffix(".doc")` should suffice, no? It may not be the cleanest looking solution, but IMO, it's better than over-complicating the method for something that would be used rarely at best.
On Mon, Mar 23, 2020 at 1:34 AM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Paul Moore writes:
I've needed to remove one prefix/suffix. I've never needed to remove more than one.
The only cases I can remember are files named things like "thesis.doc.doc" in GUI environments. ;-) _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/NT3ORR... Code of Conduct: http://python.org/psf/codeofconduct/

I personally think that there is a better case for an ignore_case flag – the number of times that I have been caught out with [‘a.doc’, ‘b.Doc’, ‘c.DOC’] especially on MS platforms.
Steve Barnes
From: Kyle Stanley aeros167@gmail.com Sent: 23 March 2020 05:49 To: Stephen J. Turnbull turnbull.stephen.fw@u.tsukuba.ac.jp Cc: python-ideas python-ideas@python.org Subject: [Python-ideas] Re: New explicit methods to trim strings
Stephen J. Turnbull wrote:
The only cases I can remember are files named things like "thesis.doc.doc" in GUI environments. ;-)
For edge cases like that, something like `"thesis.doc.doc".removesuffix(".doc").removesuffix(".doc")` should suffice, no? It may not be the cleanest looking solution, but IMO, it's better than over-complicating the method for something that would be used rarely at best.
On Mon, Mar 23, 2020 at 1:34 AM Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jpmailto:turnbull.stephen.fw@u.tsukuba.ac.jp> wrote: Paul Moore writes:
I've needed to remove one prefix/suffix. I've never needed to remove more than one.
The only cases I can remember are files named things like "thesis.doc.doc" in GUI environments. ;-) _______________________________________________ Python-ideas mailing list -- python-ideas@python.orgmailto:python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.orgmailto:python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/NT3ORR... Code of Conduct: http://python.org/psf/codeofconduct/

On Mon, Mar 23, 2020 at 5:06 PM Steve Barnes GadgetSteve@live.co.uk wrote:
I personally think that there is a better case for an ignore_case flag – the number of times that I have been caught out with [‘a.doc’, ‘b.Doc’, ‘c.DOC’] especially on MS platforms.
Case insensitivity is a mess. I think it'd be a lot cleaner to keep this as-is, and for the case insensitive use-case, people can still do it manually. (Basically case-fold for the comparison, but then trim from the original as is.)
ChrisA

I think I'm missing something, why is case insensitivity a mess?
On Mon, Mar 23, 2020 at 9:32 AM Chris Angelico rosuav@gmail.com wrote:
On Mon, Mar 23, 2020 at 5:06 PM Steve Barnes GadgetSteve@live.co.uk wrote:
I personally think that there is a better case for an ignore_case flag –
the number of times that I have been caught out with [‘a.doc’, ‘b.Doc’, ‘c.DOC’] especially on MS platforms.
Case insensitivity is a mess. I think it'd be a lot cleaner to keep this as-is, and for the case insensitive use-case, people can still do it manually. (Basically case-fold for the comparison, but then trim from the original as is.)
ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/4Q2X4N... Code of Conduct: http://python.org/psf/codeofconduct/

On Mon, Mar 23, 2020 at 7:06 PM Alex Hall alex.mojaki@gmail.com wrote:
I think I'm missing something, why is case insensitivity a mess?
Because there are many characters that case fold in strange ways. "ıIiİ".casefold() == 'ıiii̇' which means that lowercase dotless ı doesn't casefold to the same thing that uppercase dotless I. Some characters case fold to strings of different lengths, such as "ß" which casefolds to "ss". I haven't even tried what happens with combining characters vs combined characters. And Unicode case folding is already a simplified version of reality; what actual humans expect can be even more complicated, such as (I think) German case folding rules being different for names and for book titles, and the way that umlauted letters are case folded.
On the other hand, this might actually mean it's *better* to have a dedicated case-insensitive-cut-prefix operation. It would be difficult to define it in easy terms, but basically it should be such that the returned string (if not identical to the original) is the longest suffix to the original string such that, if the returned string were appended to the prefix and the result case folded, it would be the same as the original string case folded. But there could be other definitions, just as complicated, and not necessarily more correct.
In any case, this can (and in my opinion should) be deferred for later. Start with the simple one that doesn't care about all these complexities, and then expand from there as the need is found.
ChrisA

On 3/23/20 4:33 AM, Chris Angelico wrote:
On Mon, Mar 23, 2020 at 7:06 PM Alex Hall alex.mojaki@gmail.com wrote:
I think I'm missing something, why is case insensitivity a mess?
Because there are many characters that case fold in strange ways. "ıIiİ".casefold() == 'ıiii̇' which means that lowercase dotless ı doesn't casefold to the same thing that uppercase dotless I. Some characters case fold to strings of different lengths, such as "ß" which casefolds to "ss". I haven't even tried what happens with combining characters vs combined characters. And Unicode case folding is already a simplified version of reality; what actual humans expect can be even more complicated, such as (I think) German case folding rules being different for names and for book titles, and the way that umlauted letters are case folded.
On the other hand, this might actually mean it's *better* to have a dedicated case-insensitive-cut-prefix operation. It would be difficult to define it in easy terms, but basically it should be such that the returned string (if not identical to the original) is the longest suffix to the original string such that, if the returned string were appended to the prefix and the result case folded, it would be the same as the original string case folded. But there could be other definitions, just as complicated, and not necessarily more correct.
In any case, this can (and in my opinion should) be deferred for later. Start with the simple one that doesn't care about all these complexities, and then expand from there as the need is found.
The issue is that cases in Unicode are difficult, and can be locale dependent (Unicode calls this Tailoring).
In the above example with the i-s, casefold would have needed to be told that we were dealing with the Turkish Language (or maybe some other language with the same issue), but currently the Python casefold function doesn't support the needed Tailoring (and I don't know if there is an exhaustive listing somewhere of the needed tailoring)
Fully handling Unicode so as to meet all National expectations is VERY difficult, It doesn't surprise me that the Python Standard Library doesn't attempt to get it totally right, but settles for just dealing with the 'default' processing. The biggest part of this mess is that Unicode had to accept some compromises in defining Unicode (because the languages themselves present problems and inconsistencies), and when you hit a spot where the compromise goes against what you are trying to do at the moment, it gets difficult.

On Mon, Mar 23, 2020 at 10:40 PM Richard Damon Richard@damon-family.org wrote:
On 3/23/20 4:33 AM, Chris Angelico wrote:
On Mon, Mar 23, 2020 at 7:06 PM Alex Hall alex.mojaki@gmail.com wrote:
I think I'm missing something, why is case insensitivity a mess?
Because there are many characters that case fold in strange ways.
The issue is that cases in Unicode are difficult, and can be locale dependent (Unicode calls this Tailoring).
Right, which is why for a proposal like this, it's best to start with the simple and straight-forward option of case sensitivity and precise matching. Removing a prefix of "a\u0301" will not remove a leading "\xe1" and vice versa (just as those two strings don't compare equal).
ChrisA

On Mar 23, 2020, at 04:51, Chris Angelico rosuav@gmail.com wrote:
Right, which is why for a proposal like this, it's best to start with the simple and straight-forward option of case sensitivity and precise matching. Removing a prefix of "a\u0301" will not remove a leading "\xe1" and vice versa (just as those two strings don't compare equal).
Agreed, but I think it’s not just “to start with”, but forever, or at least as long as Python strings are sequences of Unicode code points. If "Café".startswith("Cafe\u0301") is false, "Café".stripprefix("Cafe\u0301") had better not strip anything. And as long as "é" in "Cafe\u0301" and any(ch=="é" for ch in "Cafe\u0301" are false, startswith is correct.
By comparison, in Swift, "Café".hasPrefix("Cafe\u{0301}") is true, because "Cafe\u{0301}" is a sequence of four Unicode scalars, the fourth of which is 'é', as opposed to Python where it’s a sequence of five Unicode code points. And of course Swift also has a slew of methods to do things like localized vs. default case-insensitive equality, substring, etc. testing, none of which Python has, or should have, as long as its strings are made of code points rather than scalars (or EGCs or whatever).

On Tue, Mar 24, 2020 at 6:31 AM Andrew Barnert abarnert@yahoo.com wrote:
On Mar 23, 2020, at 04:51, Chris Angelico rosuav@gmail.com wrote:
Right, which is why for a proposal like this, it's best to start with the simple and straight-forward option of case sensitivity and precise matching. Removing a prefix of "a\u0301" will not remove a leading "\xe1" and vice versa (just as those two strings don't compare equal).
Agreed, but I think it’s not just “to start with”, but forever, or at least as long as Python strings are sequences of Unicode code points. If "Café".startswith("Cafe\u0301") is false, "Café".stripprefix("Cafe\u0301") had better not strip anything. And as long as "é" in "Cafe\u0301" and any(ch=="é" for ch in "Cafe\u0301" are false, startswith is correct.
By comparison, in Swift, "Café".hasPrefix("Cafe\u{0301}") is true, because "Cafe\u{0301}" is a sequence of four Unicode scalars, the fourth of which is 'é', as opposed to Python where it’s a sequence of five Unicode code points. And of course Swift also has a slew of methods to do things like localized vs. default case-insensitive equality, substring, etc. testing, none of which Python has, or should have, as long as its strings are made of code points rather than scalars (or EGCs or whatever).
Maybe this would be something for the locale or unicodedata module?
ChrisA

Folks,
This is now a draft PEP, and being (has been?) discussed on python-dev -- time to go there is you want more input.
https://mail.python.org/archives/list/python-dev@python.org/thread/WFEWPAOVX...
-CHB
On Mon, Mar 23, 2020 at 12:37 PM Chris Angelico rosuav@gmail.com wrote:
On Tue, Mar 24, 2020 at 6:31 AM Andrew Barnert abarnert@yahoo.com wrote:
On Mar 23, 2020, at 04:51, Chris Angelico rosuav@gmail.com wrote:
Right, which is why for a proposal like this, it's best to start with the simple and straight-forward option of case sensitivity and precise matching. Removing a prefix of "a\u0301" will not remove a leading "\xe1" and vice versa (just as those two strings don't compare equal).
Agreed, but I think it’s not just “to start with”, but forever, or at
least as long as Python strings are sequences of Unicode code points. If "Café".startswith("Cafe\u0301") is false, "Café".stripprefix("Cafe\u0301") had better not strip anything. And as long as "é" in "Cafe\u0301" and any(ch=="é" for ch in "Cafe\u0301" are false, startswith is correct.
By comparison, in Swift, "Café".hasPrefix("Cafe\u{0301}") is true,
because "Cafe\u{0301}" is a sequence of four Unicode scalars, the fourth of which is 'é', as opposed to Python where it’s a sequence of five Unicode code points. And of course Swift also has a slew of methods to do things like localized vs. default case-insensitive equality, substring, etc. testing, none of which Python has, or should have, as long as its strings are made of code points rather than scalars (or EGCs or whatever).
Maybe this would be something for the locale or unicodedata module?
ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/YOIKAM... Code of Conduct: http://python.org/psf/codeofconduct/

Christopher Barker writes:
This is now a draft PEP, and being (has been?) discussed on python-dev -- time to go there is you want more input.
They don't, and it's a "idea" separate from the PEP. Python is not going to change the semantics of str from array of code points to array of characters. That's a Python 4000 change, and I doubt anyone who experienced Python 3000 will be on board.
Maybe this would be something for the locale or unicodedata module?
-1. It should be done (by somebody != me ;-) but not in stdlib.
Steve

On Mon, Mar 23, 2020 at 6:01 PM Stephen J. Turnbull < turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
This is now a draft PEP, and being (has been?) discussed on python-dev
--
time to go there is you want more input.
They don't, and it's a "idea" separate from the PEP.
Sure, then please starter a new thread, or at least change the title.
-CHB

On Mar 23, 2020, at 12:40, Chris Angelico rosuav@gmail.com wrote:
On Tue, Mar 24, 2020 at 6:31 AM Andrew Barnert abarnert@yahoo.com wrote:
On Mar 23, 2020, at 04:51, Chris Angelico rosuav@gmail.com wrote:
Right, which is why for a proposal like this, it's best to start with the simple and straight-forward option of case sensitivity and precise matching. Removing a prefix of "a\u0301" will not remove a leading "\xe1" and vice versa (just as those two strings don't compare equal).
Agreed, but I think it’s not just “to start with”, but forever, or at least as long as Python strings are sequences of Unicode code points. If "Café".startswith("Cafe\u0301") is false, "Café".stripprefix("Cafe\u0301") had better not strip anything. And as long as "é" in "Cafe\u0301" and any(ch=="é" for ch in "Cafe\u0301" are false, startswith is correct.
By comparison, in Swift, "Café".hasPrefix("Cafe\u{0301}") is true, because "Cafe\u{0301}" is a sequence of four Unicode scalars, the fourth of which is 'é', as opposed to Python where it’s a sequence of five Unicode code points. And of course Swift also has a slew of methods to do things like localized vs. default case-insensitive equality, substring, etc. testing, none of which Python has, or should have, as long as its strings are made of code points rather than scalars (or EGCs or whatever).
Maybe this would be something for the locale or unicodedata module?
Maybe. But a complete suite of functions for treating strings as made of Unicode scalars or EGCs or whatever seems like a lot of design work, and I don’t know if there’s enough demand for anyone to be willing to do it. Swift is a different story for a lot of reasons (brand new language, iterator model that makes “non-randomly-accessible sequence” a sensible thing, corporate team, a much worse status quo ante where strings were sequences of UTF-16 code units, a need to interface natively with Cocoa and its NFKD decomposed strings, …). For Python, it seems like if nobody’s put anything (other than thin wrappers around ICU) on PyPI, probably nobody needs support in the stdlib.

On 3/23/20 3:31 PM, Andrew Barnert via Python-ideas wrote:
On Mar 23, 2020, at 04:51, Chris Angelico rosuav@gmail.com wrote:
Right, which is why for a proposal like this, it's best to start with the simple and straight-forward option of case sensitivity and precise matching. Removing a prefix of "a\u0301" will not remove a leading "\xe1" and vice versa (just as those two strings don't compare equal).
Agreed, but I think it’s not just “to start with”, but forever, or at least as long as Python strings are sequences of Unicode code points. If "Café".startswith("Cafe\u0301") is false, "Café".stripprefix("Cafe\u0301") had better not strip anything. And as long as "é" in "Cafe\u0301" and any(ch=="é" for ch in "Cafe\u0301" are false, startswith is correct.
By comparison, in Swift, "Café".hasPrefix("Cafe\u{0301}") is true, because "Cafe\u{0301}" is a sequence of four Unicode scalars, the fourth of which is 'é', as opposed to Python where it’s a sequence of five Unicode code points. And of course Swift also has a slew of methods to do things like localized vs. default case-insensitive equality, substring, etc. testing, none of which Python has, or should have, as long as its strings are made of code points rather than scalars (or EGCs or whatever).
I wasn't familiar with the term Scalar as used in Unicode so I looked it up, and I think you are incorrect here. From the Glossery:
Unicode Scalar Value. Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive.
Thus Scalar ARE just codepoints (but exclude the surrogate pairs). What you may be thinking of is the Grapheme. It may be that Swift does some automatic conversion to a canonical form, to make the strings match. In fact, just because the text displays as Café doesn't help you know how many code-points their are, as the glyph/graheme é can be expressed as either a single code point \u00E9 (NFC), or the sequence e \u0301 (NFD), and Python can express it as either.
A basic rule with unicode strings, is if you are going to be doing these sorts of comparison, you should make sure you have both strings in the same normal form.
participants (11)
-
Alex Hall
-
Andrew Barnert
-
Barry Scott
-
Chris Angelico
-
Christopher Barker
-
Guido van Rossum
-
Kyle Stanley
-
Paul Moore
-
Richard Damon
-
Stephen J. Turnbull
-
Steve Barnes