New explicit methods to trim strings
Following the discussion here (https://link.getmailspring.com/link/7D84D131-65B6-4EF7-9C43-51957F9DFAA9@getmailspring.com/0?redirect=https%3A%2F%2Fbugs.python.org%2Fissue36410&recipient=cHl0aG9uLWlkZWFzQHB5dGhvbi5vcmc%3D) I propose to add 3 new string methods: str.trim, str.ltrim, str.rtrim Another option would be to change API for str.split method to work correctly with sequences. In [1]: def ltrim(s, seq): ...: return s[len(seq):] if s.startswith(seq) else s ...: In [2]: def rtrim(s, seq): ...: return s[:-len(seq)] if s.endswith(seq) else s ...: In [3]: def trim(s, seq): ...: return ltrim(rtrim(s, seq), seq) ...: In [4]: s = 'mailto:maria@gmail.com' In [5]: ltrim(s, 'mailto:') Out[5]: 'maria@gmail.com' In [6]: rtrim(s, 'com') Out[6]: 'mailto:maria@gmail.' In [7]: trim(s, 'm') Out[7]: 'ailto:maria@gmail.co'
On Sun, Mar 24, 2019 at 7:43 PM Alex Grigoryev <evrial@gmail.com> wrote:
Following the discussion here I propose to add 3 new string methods: str.trim, str.ltrim, str.rtrim Another option would be to change API for str.split method to work correctly with sequences.
In [1]: def ltrim(s, seq): ...: return s[len(seq):] if s.startswith(seq) else s ...: [corresponding functions snipped]
You may need to clarify here one of two options: either ltrim accepts *only and precisely* a string, not an arbitrary sequence (as your parameter naming suggests); or that it accepts an arbitrary sequence, but with different semantics to your example. With str.startswith, any sequence can be accepted, and if the string starts with *any* of the strings, it will return True:
"abcd".startswith(("ab", "qw", "12")) True
Your simple one-liner would take the length of the tuple (3) and remove that many characters. From the BPO discussion, I suspect you actually just want to use a single string here, but I could be wrong, especially with the suggestion to make str.split work with sequences; do you mean that you want to be able to split on any string in the sequence, or split arbitrary sequences, or something else? (Another option here - since an email address won't usually contain a colon - would be to use s.replace("mailto:", "") to remove the prefix. Technically it IS valid and possible, but it's not something I see in the wild, so you're unlikely to break anyone's address by removing "mailto:" out of the middle of it.) For complicated string matching and replacement work, you may need to reach for the 're' module. Yes, I'm aware that then you'll have two problems, but it's in the stdlib for a reason. ChrisA
I don't see what trim() is good for but I know I've written ltrim() hundreds of times easy. I propose naming them strip_prefix() and strip_suffix() and just skip the one that does both sides since it makes no sense to me. Trim is generally a bad name because what is called strip() in python is called trim() in other languages. This would be needlessly confusing.
On 24 Mar 2019, at 09:42, Alex Grigoryev <evrial@gmail.com> wrote:
Following the discussion here I propose to add 3 new string methods: str.trim, str.ltrim, str.rtrim Another option would be to change API for str.split method to work correctly with sequences.
In [1]: def ltrim(s, seq):
...: return s[len(seq):] if s.startswith(seq) else s
...:
In [2]: def rtrim(s, seq):
...: return s[:-len(seq)] if s.endswith(seq) else s
...:
In [3]: def trim(s, seq):
...: return ltrim(rtrim(s, seq), seq)
...:
In [4]: s = 'mailto:maria@gmail.com'
In [5]: ltrim(s, 'mailto:')
Out[5]: 'maria@gmail.com'
In [6]: rtrim(s, 'com')
Out[6]: 'mailto:maria@gmail.'
In [7]: trim(s, 'm')
Out[7]: 'ailto:maria@gmail.co'
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Yeah good idea with names because php ltrim does the same as lstrip in python. Normally I'd expect strip to behave as I proposed, not like input a string as mask of characters, which is more rare use case and confusing in some scenarios. On мар т 24 2019, at 11:34 утра, Anders Hovmöller <boxed@killingar.net> wrote:
I don't see what trim() is good for but I know I've written ltrim() hundreds of times easy.
I propose naming them strip_prefix() and strip_suffix() and just skip the one that does both sides since it makes no sense to me.
Trim is generally a bad name because what is called strip() in python is called trim() in other languages. This would be needlessly confusing.
On 24 Mar 2019, at 09:42, Alex Grigoryev <evrial@gmail.com (https://link.getmailspring.com/link/5181B0DB-3B10-4202-90D6-1365AEF19654@getmailspring.com/0?redirect=mailto%3Aevrial%40gmail.com&recipient=cHl0aG9uLWlkZWFzQHB5dGhvbi5vcmc%3D)> wrote:
Following the discussion here (https://link.getmailspring.com/link/5181B0DB-3B10-4202-90D6-1365AEF19654@getmailspring.com/1?redirect=https%3A%2F%2Flink.getmailspring.com%2Flink%2F7D84D131-65B6-4EF7-9C43-51957F9DFAA9%40getmailspring.com%2F0%3Fredirect%3Dhttps%253A%252F%252Fbugs.python.org%252Fissue36410%26recipient%3DcHl0aG9uLWlkZWFzQHB5dGhvbi5vcmc%253D&recipient=cHl0aG9uLWlkZWFzQHB5dGhvbi5vcmc%3D) I propose to add 3 new string methods: str.trim, str.ltrim, str.rtrim Another option would be to change API for str.split method to work correctly with sequences.
In [1]: def ltrim(s, seq): ...: return s[len(seq):] if s.startswith(seq) else s ...:
In [2]: def rtrim(s, seq): ...: return s[:-len(seq)] if s.endswith(seq) else s ...:
In [3]: def trim(s, seq): ...: return ltrim(rtrim(s, seq), seq) ...:
In [4]: s = 'mailto:maria@gmail.com (https://link.getmailspring.com/link/5181B0DB-3B10-4202-90D6-1365AEF19654@getmailspring.com/2?redirect=mailto%3Amaria%40gmail.com&recipient=cHl0aG9uLWlkZWFzQHB5dGhvbi5vcmc%3D)'
In [5]: ltrim(s, 'mailto:') Out[5]: 'maria@gmail.com (https://link.getmailspring.com/link/5181B0DB-3B10-4202-90D6-1365AEF19654@getmailspring.com/3?redirect=mailto%3Amaria%40gmail.com&recipient=cHl0aG9uLWlkZWFzQHB5dGhvbi5vcmc%3D)'
In [6]: rtrim(s, 'com') Out[6]: 'mailto:maria@gmail.'
In [7]: trim(s, 'm') Out[7]: 'ailto:maria@gmail.co (https://link.getmailspring.com/link/5181B0DB-3B10-4202-90D6-1365AEF19654@getmailspring.com/4?redirect=mailto%3Amaria%40gmail.co&recipient=cHl0aG9uLWlkZWFzQHB5dGhvbi5vcmc%3D)' _______________________________________________ Python-ideas mailing list Python-ideas@python.org (https://link.getmailspring.com/link/5181B0DB-3B10-4202-90D6-1365AEF19654@getmailspring.com/5?redirect=mailto%3APython-ideas%40python.org&recipient=cHl0aG9uLWlkZWFzQHB5dGhvbi5vcmc%3D) https://mail.python.org/mailman/listinfo/python-ideas (https://link.getmailspring.com/link/5181B0DB-3B10-4202-90D6-1365AEF19654@getmailspring.com/6?redirect=https%3A%2F%2Fmail.python.org%2Fmailman%2Flistinfo%2Fpython-ideas&recipient=cHl0aG9uLWlkZWFzQHB5dGhvbi5vcmc%3D) Code of Conduct: http://python.org/psf/codeofconduct/ (https://link.getmailspring.com/link/5181B0DB-3B10-4202-90D6-1365AEF19654@getmailspring.com/7?redirect=http%3A%2F%2Fpython.org%2Fpsf%2Fcodeofconduct%2F&recipient=cHl0aG9uLWlkZWFzQHB5dGhvbi5vcmc%3D)
On Sun, Mar 24, 2019 at 2:47 AM Alex Grigoryev <evrial@gmail.com> wrote:
Yeah good idea with names because php ltrim does the same as lstrip in python. Normally I'd expect strip to behave as I proposed, not like input a string as mask of characters, which is more rare use case and confusing in some scenarios.
I agree -- I actually wrote buggy code in a PyPi published package that incorrectly used strip(*) in this way: i.e. I expected. My bad for not reading the docs carefully and writing crappy tests (yes, there were tests -- shows you how meaningless 100% coverage is) So +1 for some version of "remove exactly this substring from the left or right of a string" I agree that the "either end" option is unlikley to be useful, and it the rare case you want it, you can call both. I'll let others bikeshed on the name. And this really is simple enough that I don't want to reach for regex's for it. That is, I'd write it by hand rather than mess with that. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
And this really is simple enough that I don't want to reach for regex's for it. That is, I'd write it by hand rather than mess with that.
Well, with re.escape it's not messy at all : import re def trim_mailto(s): regex = re.compile("^" + re.escape("mailto:")) return regex.sub('', s) With literally means "if you have mailto: at the beginning, replace it with the empty string" You could do a ltrim function in one line : def ltrim(s, x): return re.sub("^" + re.escape(x), '', s) Escape will take care of escaping special characters, so the regex escape(x) matches exactly the string "x".
And this really is simple enough that I don't want to reach for regex's for it. That is, I'd write it by hand rather than mess with that.
Well, with re.escape it's not messy at all :
import re def trim_mailto(s): regex = re.compile("^" + re.escape("mailto:")) return regex.sub('', s)
With literally means "if you have mailto: at the beginning, replace it with the empty string"
You could do a ltrim function in one line :
def ltrim(s, x): return re.sub("^" + re.escape(x), '', s)
Escape will take care of escaping special characters, so the regex escape(x) matches exactly the string "x".
I think re.sub("^" + re.escape(x), '', s) is a lot more messy and hard to read than s[len(prefix):] if s.startswith(prefix) else s it's also roughly an order of magnitude slower. / Anders
And this really is simple enough that I don't want to reach for regex's
for it. That is, I'd write it by hand rather than mess with that.
Well, with re.escape it's not messy at all :
You could do a ltrim function in one line :
def ltrim(s, x): return re.sub("^" + re.escape(x), '', s)
I think
re.sub("^" + re.escape(x), '', s)
is a lot more messy and hard to read than
s[len(prefix):] if s.startswith(prefix) else s
it's also roughly an order of magnitude slower.
I agree, but I said I wouldn’t choose to “mess” with regex, not that the resulting code would be messy. There is a substantial cognitive load in working with regex—they are another language. If you aren’t familiar with them (I’m not) it would take some time, and googling, to find that solution. If you are familiar with them, you still need to import another module and end up with an arguably less readable and slower solution. Python was designed from the beginning not to rely on regex for simple string processing, opting for fairly full featured set of string methods. These two simple methods fit well into that approach. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On 2019-03-24 08:42, Alex Grigoryev wrote:
Following the discussion here <https://link.getmailspring.com/link/7D84D131-65B6-4EF7-9C43-51957F9DFAA9@getmailspring.com/0?redirect=https%3A%2F%2Fbugs.python.org%2Fissue36410&recipient=cHl0aG9uLWlkZWFzQHB5dGhvbi5vcmc%3D> I propose to add 3 new string methods: str.trim, str.ltrim, str.rtrim Another option would be to change API for str.split method to work correctly with sequences.
In [1]: def ltrim(s, seq):
...: return s[len(seq):] if s.startswith(seq) else s
...:
This has a subtle bug:
In [2]: def rtrim(s, seq):
...: return s[:-len(seq)] if s.endswith(seq) else s
...:
If len(seq) == 0, then rtrim will return ''. It needs to be: def rtrim(s, seq): return s[ : len(s) - len(seq)] if s.endswith(seq) else s
On 24Mar2019 18:39, MRAB <python@mrabarnett.plus.com> wrote:
On 2019-03-24 08:42, Alex Grigoryev wrote:
Following the discussion here <https://link.getmailspring.com/link/7D84D131-65B6-4EF7-9C43-51957F9DFAA9@getmailspring.com/0?redirect=https%3A%2F%2Fbugs.python.org%2Fissue36410&recipient=cHl0aG9uLWlkZWFzQHB5dGhvbi5vcmc%3D> This has a subtle bug:
In [2]: def rtrim(s, seq):
...: return s[:-len(seq)] if s.endswith(seq) else s
...:
If len(seq) == 0, then rtrim will return ''.
It needs to be:
def rtrim(s, seq): return s[ : len(s) - len(seq)] if s.endswith(seq) else s
Or: return s[:-len(seq)] if seq and s.endswith(seq) else s which I think more readable. For the record, like others, I suspect I've written ltrim/rtrim code many times. I'm +0.9 on the idea: it feels like a very common operation and as shown above rtrim at least is fairly easily miscoded. (I think most of my own situations were with strings I know are not empty, often literals, but that doesn't really detract.) Like others I'm against the name 'trim" itself because of PHP's homonym which means what "strip" means in Python (and therefore doesn't mean what "trim" is proposed to mean here). "clip"? I'm +0.9 rather than +1 entirely because the operation feels so... trivial, which usually trips the "not everything needs a method" argument. But it is also very common. Cheers, Cameron Simpson <cs@cskk.id.au>
On 3/24/19 6:45 PM, Cameron Simpson wrote:
Like others I'm against the name 'trim" itself because of PHP's homonym which means what "strip" means in Python (and therefore doesn't mean what "trim" is proposed to mean here). "clip"?
I'm +0.9 rather than +1 entirely because the operation feels so... trivial, which usually trips the "not everything needs a method" argument. But it is also very common.
strip, trim, chop, chomp, clip, left, right, and various permutations with leading "l"s and "r"s. Is the "other" argument a character, a string, or a list of characters, or a list of strings, or a regex? Argh. Maybe I use too many languages and I don't do enough string processing, but I always have to look this stuff up every time I use it, or I just write my own. No, I don't have a solution, but matching or mismatching any particular language only makes sense if you happen to be familiar with that language's string functions. And then someone will fall into a trap because "their" language handles newlines and returns, or spaces and tabs, or some other detail completely differently. That said, I'm all for more library functions, especially in cases like this that are easy to get wrong.
Instead of naming these operations, we could use '+' and '-', with semantics: # Set the values of the variables. >>> a = 'hello ' >>> b = 'world' >>> c = 'hello world' # Some values between the variables. >>> a + b == c True >>> a == c - b True >>> b = -a + c True # Just like numbers except. >>> a + b == b + a False This approach has both attractions and problems. And also decisions. The main issue, I think come to this. Suppose we have a, A = ('a', -'a') b, B = ('b', -'b') a + A == A + a == '' b + B == B + b == '' A + '' == '' + A == A B + '' == '' + B == B together with unrestricted addition of a, A, b, B then we have what mathematicians call the free group on 2 letters, which is an enormous object. If you want the math, look at, https://en.wikipedia.org/wiki/Free_group#Examples We've made a big mistake, I think, if we allow Python programmers to accidentally encounter this free group. One way to look at this, is that we want to cut the free group down to a useful size. One way is
'hello ' - 'world' == 'hello' # I like to call this truncation. True Another way is >>> 'hello' - 'world' # I like to call this subtraction. ValueError: string s1 does not end with s2, so can't be subtracted
I hope this little discussion helps with naming things. I think this is enough for now. -- Jonathan
On Mon, Mar 25, 2019 at 9:24 PM Jonathan Fine <jfine2358@gmail.com> wrote:
Instead of naming these operations, we could use '+' and '-', with semantics:
# Set the values of the variables. >>> a = 'hello ' >>> b = 'world' >>> c = 'hello world'
# Some values between the variables. >>> a + b == c True >>> a == c - b True >>> b = -a + c True
The semantics are rather underdefined here. What *exactly* does string subtraction do? Is a-b equivalent to a.replace(b, "") or something else? Also.... you imply that it's possible to negate a string and then add it, but... what does a negative string look like? *confused* ChrisA
Chris Angelico asked: what does a negative string look like? This is a very good question. It looks a bit like a negative number. >>> 2 + 2 4 >>> len('aa' + 'bb') 4 >>> len(-'bb') -2 # Odd, I must confess. >>> 5 + (-1) 4 >>> len('hello') 5 >>> len(-'o') -1 >>> 'hello' + (-'o') 'hell' >>> len('hello' + (-'o')) 4 Grade school: How can I possible have -3 apples in my bag. University: How can I possibly be overdrawn in my bank account. Negative strings are similar to negative numbers except: For numbers a + b == b + a For strings a + b != b + a It is the non-commuting that make negative strings difficult. This is a bit like computer programming. It's not enough to have the correct lines of code (or notes). They also have to be put in the right order. I hope this helps. I do this sort of math all the time, and enjoy it. Your experience may be different. -- Jonathan
More on negative strings. They are easier, if they only use one character. Red Queen: What's one and one and one and one and one and one and one and one and one and one and one and one and one? Alice: I don't know. I lost count. Red Queen: She can't do arithmetic. 3 --> 'aaa' 2 --> 'aa' 1 --> 'a' 0 --> '' -1 -> -'a' -2 -> -'aa' -3 -> -'aaa' Negative strings are easier if we can rearrange the order of the letters. Like anagrams. >>> ''.join(sorted('forty five')) ' effiortvy' >>> ''.join(sorted('over fifty')) ' effiortvy' Instead of counting (positively and negatively) just the letter 'a', we do the whole alphabet. By when order matters, we get an enormous free group, which Python programmers by accident see. I hope this helps. -- Jonathan
On 3/25/19 7:01 AM, Jonathan Fine wrote:
Chris Angelico asked: what does a negative string look like?
This is a very good question. It looks a bit like a negative number.
>>> 2 + 2 4 >>> len('aa' + 'bb') 4 >>> len(-'bb') -2 # Odd, I must confess. >>> 5 + (-1) 4 >>> len('hello') 5 >>> len(-'o') -1 >>> 'hello' + (-'o') 'hell' >>> len('hello' + (-'o')) 4
Grade school: How can I possible have -3 apples in my bag. University: How can I possibly be overdrawn in my bank account.
Negative strings are similar to negative numbers except: For numbers a + b == b + a For strings a + b != b + a
It is the non-commuting that make negative strings difficult. This is a bit like computer programming. It's not enough to have the correct lines of code (or notes). They also have to be put in the right order.
In the abstract, I believe I understand what Jonathan is saying, and in the concrete, I understand Chris's objection. Ridding a string of some of the graphemes from one end, or the other, or both, or elsewhere, is one or more different operations on the same underlying data type. We just went through this with dictionaries. So what it is "hello" - "world"? "hello" because it doesn't end in "world"? "hello" because it doesn't begin with "world"? "he" because that's "hello" with all the graphemes also in "world" removed? "he" because that's "hello" with all the graphemes also in "world" removed from the end? "hello" because that's "hello" with all the graphemes also in "world" removed from the begining?" And once we pick one of those results, what operator(s) produce the others and don't lead to perl or APL? And no matter how much Python I learn, I still can't divide by zero or by an empty string. ;-)
Dan Sommers wrote:
So what it is "hello" - "world"?
If we were to implement the entire group, it would be an element that can't be written in any simpler form. We could do that by representing a string as a sequence of signed substrings, and performing cancellations whereever possible during concatenation. But that would be a huge amount of machinery just to provide a cute notation for removing a prefix or suffix, with little in the way of other obvious applications. -- Greg
On 25/03/2019 12:01, Jonathan Fine wrote:
Chris Angelico asked: what does a negative string look like?
This is a very good question. It looks a bit like a negative number.
They really don't. Negative numbers are well defined in terms of being the additive inverse of natural numbers. String concatenation doesn't have a well-defined inverse, as you demonstrated by not actually trying to define it. It strikes me that following this line of reasoning is at best a category error. -- Rhodri James *-* Kynesim Ltd
Rhodri James wrote:
They really don't. Negative numbers are well defined in terms of being the additive inverse of natural numbers. String concatenation doesn't have a well-defined inverse,
In an earlier post I showed (assuming some knowledge of group theory) that for strings in the two letters 'a' and 'b', allowing negative strings give rise to what mathematicians call the free group on 2 letters, which is an enormous object. If you want the math, look at https://en.wikipedia.org/wiki/Free_group#Construction [Except previously I linked to the wrong part of the page.] Free groups are a difficult concept, usually introduced at post-graduate level. If you can tell me you understand that concept, I'm happy on that basis to explain how it provides string concatenation with a well-defined inverse. -- Jonathan
On Mon, Mar 25, 2019 at 7:30 AM Rhodri James <rhodri@kynesim.co.uk> wrote:
On 25/03/2019 12:01, Jonathan Fine wrote:
Chris Angelico asked: what does a negative string look like?
This is a very good question. It looks a bit like a negative number.
They really don't. Negative numbers are well defined in terms of being the additive inverse of natural numbers. String concatenation doesn't have a well-defined inverse, as you demonstrated by not actually trying to define it. It strikes me that following this line of reasoning is at best a category error.
I assume the whole proposal was a pastiche of the proposal to add a + operator for dictionaries. Jonathan needs to come clean before more people waste their time discussing this. -- --Guido van Rossum (python.org/~guido)
I think this is a terrible idea. I also think it's a mistake that python uses + for string concatenation and * for string repeat. You end up with type errors far from the first place you could have had the crash! That ship has obviously sailed buy we shouldn't make even more mistakes in the same vain because we have some Stockholm syndrome with the current state of the language, or for a misplaced ideal of consistency.
On 25 Mar 2019, at 11:22, Jonathan Fine <jfine2358@gmail.com> wrote:
Instead of naming these operations, we could use '+' and '-', with semantics:
# Set the values of the variables. >>> a = 'hello ' >>> b = 'world' >>> c = 'hello world'
# Some values between the variables. >>> a + b == c True >>> a == c - b True >>> b = -a + c True
# Just like numbers except. >>> a + b == b + a False
This approach has both attractions and problems. And also decisions. The main issue, I think come to this. Suppose we have a, A = ('a', -'a') b, B = ('b', -'b') a + A == A + a == '' b + B == B + b == '' A + '' == '' + A == A B + '' == '' + B == B together with unrestricted addition of a, A, b, B then we have what mathematicians call the free group on 2 letters, which is an enormous object. If you want the math, look at, https://en.wikipedia.org/wiki/Free_group#Examples
We've made a big mistake, I think, if we allow Python programmers to accidentally encounter this free group. One way to look at this, is that we want to cut the free group down to a useful size. One way is
'hello ' - 'world' == 'hello' # I like to call this truncation. True Another way is >>> 'hello' - 'world' # I like to call this subtraction. ValueError: string s1 does not end with s2, so can't be subtracted
I hope this little discussion helps with naming things. I think this is enough for now.
-- Jonathan
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Here, concisely, is my view of the situation and my preferences. Mostly, I won't give supporting arguments or evidence. We can TRUNCATE either PRE or the POST, and similarly SUBTRACT. SUBTRACT can raise a ValueError. TRUNCATE always returns a value. Interactive examples (not tested) >>> from somewhere import post_subtract >>> sub_ed = post_subtract('ed') >>> sub_ed('fred') >>> 'fr' >>> sub_ed('lead') ValueError Similarly >>> trunc_ed('fred') 'fr' >>> trunc_ed('lead') 'lead' Can be 'combined into one'
pre_truncate('app')('applet) 'let' >>> pre_truncate('app')('paper') 'paper'
Possibly 1. Allow pre_truncate('app', 'applet'), perhaps with different spelling. 2. Allow '-' as a symbol for subtract. (Likely to be controversial.) I'm not particularly attached to the names. But I definitely think 3. None of these are string methods. (So pure Python implementation automatically backports.) 4. Encourage a 'two-step' process. This allow separation of concerns, and encourage good names. Supporting argument. When we write pre_subtract(suffix, s) the suffix has a special meaning. For example, it's the header. So in one module define and test a routine remove_header. And in another module use remove_header. That way, the user of remove_header only needs to know the business purpose of the command. And the implementer needs to know only the value of the header. If the specs change, and the implementer needs to use regular expressions, then this does not affect the user of remove_header. I hope this helps. Maybe others would like to express their preferences. -- Jonathan -- Jonathan
Earlier, Anders wrote: I propose naming them strip_prefix() and strip_suffix() and just skip the one that does both sides since it makes no sense to me. This is good, except I prefer subtract_prefix(a, b), truncate_suffix etc. And for the two step process prefix_subtractor(a)(b) etc. -- Jonathan
Earlier, Anders wrote: I propose naming them strip_prefix() and strip_suffix() and just skip the one that does both sides since it makes no sense to me.
This is good, except I prefer subtract_prefix(a, b), truncate_suffix etc. And for the two step process prefix_subtractor(a)(b) etc.
I don't understand the logic for "subtract". That's not a thing for non-numbers. If you don't think "strip" is good, then I suggest "remove". Or one could also consider "without" since we're talking about something that removes /if present/ (making subtract even worse! Subtract doesn't stop at zero). So "without_prefix()".
strip_prefix and strip_suffix I think are the best names from all and work perfectly with auto completion. Common use case: " mailto:maria@gmail.com".strip().strip_prefix("mailto:") On Mar 25 2019, at 4:40 pm, Anders Hovmöller <boxed@killingar.net> wrote:
Earlier, Anders wrote: I propose naming them strip_prefix() and strip_suffix() and just skip the one that does both sides since it makes no sense to me.
This is good, except I prefer subtract_prefix(a, b), truncate_suffix etc. And for the two step process prefix_subtractor(a)(b) etc. I don't understand the logic for "subtract". That's not a thing for non-numbers. If you don't think "strip" is good, then I suggest "remove". Or one could also consider "without" since we're talking about something that removes /if present/ (making subtract even worse! Subtract doesn't stop at zero). So "without_prefix()".
Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
All of this would be well served by a 3rd party library on PyPI. Strings already have plenty of methods (probably too many). Having `stringtools` would be nice to import a bunch of simple functions from. On Mon, Mar 25, 2019 at 10:45 AM Alex Grigoryev <evrial@gmail.com> wrote:
strip_prefix and strip_suffix I think are the best names from all and work perfectly with auto completion. Common use case:
" mailto:maria@gmail.com".strip().strip_prefix("mailto:")
On Mar 25 2019, at 4:40 pm, Anders Hovmöller <boxed@killingar.net> wrote:
Earlier, Anders wrote: I propose naming them strip_prefix() and strip_suffix() and just skip the one that does both sides since it makes no sense to me.
This is good, except I prefer subtract_prefix(a, b), truncate_suffix etc. And for the two step process prefix_subtractor(a)(b) etc.
I don't understand the logic for "subtract". That's not a thing for non-numbers.
If you don't think "strip" is good, then I suggest "remove". Or one could also consider "without" since we're talking about something that removes /if present/ (making subtract even worse! Subtract doesn't stop at zero). So "without_prefix()". _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
[image: Sent from Mailspring] _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
All of this would be well served by a 3rd party library on PyPI. Strings already have plenty of methods (probably too many). Having `stringtools` would be nice to import a bunch of simple functions from.
I respectfully disagree. This isn't javascript where we are OK with millions of tiny dependencies. Python is batteries included and that's a great thing. This is just a tiny battery that was overlooked :) / Anders
On Mon, 25 Mar 2019 at 17:49, Anders Hovmöller <boxed@killingar.net> wrote:
All of this would be well served by a 3rd party library on PyPI. Strings already have plenty of methods (probably too many). Having `stringtools` would be nice to import a bunch of simple functions from.
I respectfully disagree. This isn't javascript where we are OK with millions of tiny dependencies. Python is batteries included and that's a great thing. This is just a tiny battery that was overlooked :)
While batteries included is a very good principle (and one I've argued for strongly in the past) it's also important to remember that Python is a mature language, and the days of being able to assume that "most people" will be on a recent version are gone. Adding these functions to the stdlib would mean that *only* people using Python 3.8+ would have access to them (and in particular, library authors wouldn't be able to use them until they drop support for all versions older than 3.8). Having the functions as an external library makes them accessible to *every* Python user. As with everything, it's a trade-off. IMO, in this case the balance is in favour of a 3rd party library (at least initially - it's perfectly possible to move the library into the stdlib later if it becomes popular). Paul
All of this would be well served by a 3rd party library on PyPI. Strings already have plenty of methods (probably too many). Having `stringtools` would be nice to import a bunch of simple functions from.
I respectfully disagree. This isn't javascript where we are OK with millions of tiny dependencies. Python is batteries included and that's a great thing. This is just a tiny battery that was overlooked :)
While batteries included is a very good principle (and one I've argued for strongly in the past) it's also important to remember that Python is a mature language, and the days of being able to assume that "most people" will be on a recent version are gone.
It's much more true now than it has been in over a decade. People have largely moved away from python 2.7 and after that it's pretty easy to keep pace. There's a lag, but it's no longer decades.
Adding these functions to the stdlib would mean that *only* people using Python 3.8+ would have access to them (and in particular, library authors wouldn't be able to use them until they drop support for all versions older than 3.8). Having the functions as an external library makes them accessible to *every* Python user.
Sure. And if library authors want to support older versions they'll have to vendor this into their own code, just like always. This seems totally irrelevant to the discussion. And it's of course irrelevant to all the end users that aren't writing libraries but are using python directly.
As with everything, it's a trade-off. IMO, in this case the balance is in favour of a 3rd party library (at least initially - it's perfectly possible to move the library into the stdlib later if it becomes popular).
Putting it in a library virtually guarantees it will never become popular. And because we are talking about new methods on str, a library that monkey patches on two new method on str won't become popular for obvious reasons. Plus it's actually impossible:
str.foo = 1 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can't set attributes of built-in/extension type 'str'
So this can't really be moved into the standard library or be implemented by a library in a nice way. / Anders
Anders Hovmöller writes:
Sure. And if library authors want to support older versions they'll have to vendor this into their own code,
You (indirectly) argue below that they can't, as a reason for including the change. You can't have it both ways.
just like always. This seems totally irrelevant to the discussion. And it's of course irrelevant to all the end users that aren't writing libraries but are using python directly.
No, it's not "irrelevant". I wish we all would stop using that word, and trying to exclude others' arguments in this way. We are balancing equities here. We have a plethora of changes, on the one side taken by itself each of which is an improvement, but on the other taken as a group they greatly increase the difficulty of learning to read Python programs fluently. So we set a bar that the change must clear, and the ability of the change to clear it depends on the balance of equities. In this case, where it requires C support and is not possible to "from __future__", the fact that library maintainers can't use it until they drop support for past versions of Python weakens the argument for the change by excluding important bodies of code from using it.
Putting it in a library virtually guarantees it will never become popular.
Factually, you're wrong. Many libraries have moved from PyPI to the stdlib, often very quickly as they prove their worth in a deliberate test. Also, here "popular" has a special meaning. It doesn't mean millions of downloads. It means people say they like it in blogs, recommend it to others, and start to post to Python development channels saying how much it improves their code and posting examples of how it does so.
And because we are talking about new methods on str, a library that monkey patches on two new method on str won't become popular for obvious reasons [specifically, it's impossible].
This is a valid point. But it doesn't need to be a monkey patch. Note that decimal was introduced with no literal syntax and is quite useful and used. If this change is going to prove it's tall enough to ride the stdlib ride, using a constructor for a derived class rather than str literal syntax shouldn't be too big a barrier to judging popularity (accounting for the annoyance of a constructor). Alternatively, the features could be introduced using functions. Steve
Could we try to keep the discussion about the topic at hand? There are a broad set if considerations that apply to any change, but they don’t all apply equally to all proposals. The proposal at hand is to add two fairly straightforward methods to string. So:
We are balancing equities here. We have a plethora of changes, on the one side taken by itself each of which is an improvement, but on the other taken as a group they greatly increase the difficulty of learning to read Python programs fluently.
Unless the methods are really poorly named, then this will make them maybe a tiny bit more readable, not less. But tiny. So “irrelevant” may be appropriate here. So we set a bar that the
change must clear, and the ability of the change to clear it depends on the balance of equities.
Exactly — small impact, low bar. In this case, where it requires C support and is not possible to "from
__future__", the fact that library maintainers can't use it until they drop support for past versions of Python weakens the argument for the change by excluding important bodies of code from using it.
But there is no need for __future__ — it’s not a breaking change. It could be back ported to any version we want. Same as a __future__ import.
Putting it in a library virtually guarantees it will never become
popular.
Factually, you're wrong.
I don’t think he is, and I made the same point earlier. He did not say that no PyPi libs become popular or are brought into the stdlib, He said that this particular proposal is not suited to that. Do you really think a lib with two (or a few, though no one yet has suggested anymore) almost trivial string functions will gain any traction??
Note that decimal was introduced with no literal syntax and is quite useful and used.
But Decimal isn’t a float with a couple extra handy methods. If it didn’t provide significant extra functionality, no one would use it. And many strings in our code are created from other code — so then you’d need to wrap MySpecialString() around every function call that produced a string. Again— it’s not going to happen. Alternatively, the features could be introduced using functions. Better than a custom class, but still too awkward to bother with for this — see a previous post of mine for more detail. This proposal would provide a minor gain for an even more minor disruption. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Fri, Mar 29, 2019 at 04:05:55PM -0700, Christopher Barker wrote:
This proposal would provide a minor gain for an even more minor disruption.
I don't think that is correct. I think you are underestimating the gain and exaggerating the disruption :-) Cutting a prefix or suffix from a string is a common task, and there is no obvious "battery" in the std lib available for it. And there is a long history of people mistaking strip() and friends as that battery. The problem is that it seems to work: py> "something.zip".rstrip(".zip") 'something' until it doesn't: py> "something.jpg".rstrip(".jpg") 'somethin' It is *very common* for people to trip over this and think they have found a bug: https://duckduckgo.com/?q=python+bug+in+strip I would guestimate that for every person who think that they found a bug, there are probably a hundred who trip over this and then realise their error without ever going public. I believe this is a real pain point for people doing string processing. I know it has bitten me once or twice. The correct solution is a verbose statement: if string.startswith("spam"): string = string[:len("spam")] which repeats itself (*two* references to the prefix being removed, *three* references to the string being cut). The expression form is no better: process(a, b, string[:len("spam")] if string.startswith("spam") else string, c) and heaven help you if you need to cut from both ends. To make that practical, you really need a helper function. Now that's fine as far as it goes, but why do we make people re-invent the wheel over and over again? A pair of "cut" methods (cut prefix, cut suffix) fills a real need, and will avoid a lot of mistaken bug reports/questions. As for the disruption, I don't see that this will cause *any* disruption at all, beyond bike-shedding the method names and doing an initial implementation. It is a completely backwards compatible change. Since we can't monkey-patch builtins, this isn't going to break anyone's use of str. Any subclasses of str which define the same methods will still work. I've sometimes said in the past that any change will break *someone's* code, and so we should be risk-adverse. I still stand by that, but we shouldn't be *so risk adverse* that we're paralysed. Breaking users' code is a cost, but there is also the uncounted opportunity cost of *not* adding this useful battery. If we don't add these new methods, how many hundreds of users over the next decade will we condemn to repeating the same old misuse of strip() that has been misused so often in the past? How much developer time will be wasted writing, and then closing, bug reports like this? https://bugs.python.org/issue5318 Inaction has costs too. I can only think of one scenario where this change might break someone's code: - we decide on method names (let's say) lcut and rcut; - somebody else already has a class with lcut and rcut; - which does something completely different; - and they use hasattr() to decide whether to call those methods, rather than isinstance: if hasattr(myobj, 'lcut'): print(myobj.lcut(1, 2, 3, 4)) else: # do something else - and they sometimes pass strings into this code. In 3.7 and older, ordinary strings will take the second path. If we add these methods, they will take the first path. But the chances of this actually being more than a trivially small problem for anyone in real life is so small that I don't know why I even raise it. This isn't a minor disruption. Its a small possibility of a minor disruption to a tiny set of users who can fix the breakage easily. The functionality is clear, meets a real need, is backwards compatible, and has no significant downsides. The only hard part is bikeshedding names for the methods: lcut rcut cutprefix cutsuffix ltrim rtrim prestrip poststrip etc. Am I wrong about any of these statements? -- Steven
On 30Mar2019 12:37, Steven D'Aprano <steve@pearwood.info> wrote:
On Fri, Mar 29, 2019 at 04:05:55PM -0700, Christopher Barker wrote:
This proposal would provide a minor gain for an even more minor disruption.
I don't think that is correct. I think you are underestimating the gain and exaggerating the disruption :-)
Cutting a prefix or suffix from a string is a common task, and there is no obvious "battery" in the std lib available for it. And there is a long history of people mistaking strip() and friends as that battery. The problem is that it seems to work:
py> "something.zip".rstrip(".zip") 'something'
until it doesn't:
py> "something.jpg".rstrip(".jpg") 'somethin'
Yeah, this is a very common mistake. I don't think I've made it myself (not really sure why, except that I use strip a lot to remove whitespace so I don't think about the file extesion thing for it). But I've seen people make this mistake. And personally I strip prefixes or suffixes from strings a lot and the "measure the suffix and get s[:-len(suffix)]" shuffle is tedious. Also I need to decode that shuffle in my head every time I see it _and_ debug it because in the file extension case I'm always concerned as to whether it gets the "." separator or not. With .cutsuffix('.foo') it is really obvious and unambiguous. Also, I'm curious - how often to people use strip() to strip stuff other than whitespace? It is rare or unknown for myself. So I am a data point for the individually small but common gain. [...adding a method to str is only going to break quite weird code...]
The functionality is clear, meets a real need, is backwards compatible, and has no significant downsides. The only hard part is bikeshedding names for the methods:
lcut rcut cutprefix cutsuffix ltrim rtrim prestrip poststrip etc.
Am I wrong about any of these statements?
I do not think so. I agree with everything you've said, anyway. For the shed: I'm a big -1 on ltrim and rtrim because of confusion with the VERY well known PHP trim function which does something else. I like lcut/rcut as succienct and reminiscent of the UNIX "cut" command. I like cutprefix and cutsuffix even more, as having similar heft as the startswith and endswith methods. I dislike prestrip and poststrip because of their similarity to strip, which like the PHP trim does something else. -1 here too. Cheers, Cameron Simpson <cs@cskk.id.au>
On Fri, Mar 29, 2019 at 6:38 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Fri, Mar 29, 2019 at 04:05:55PM -0700, Christopher Barker wrote:
This proposal would provide a minor gain for an even more minor disruption.
I don't think that is correct. I think you are underestimating the gain and exaggerating the disruption :-)
I am very confused - I made that statement in response to a post of yours ( unless I got the attribution wrong) in which you seemed to be arguing against the proposal. But If I goaded you into making a strong case that I completely agree with — great! And for the record: I have put that very bug into production code. And I do use strip() with things other than white space — though never multiple characters (at least not on purpose) -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Sat, Mar 30, 2019 at 12:15:00AM -0700, Christopher Barker wrote:
On Fri, Mar 29, 2019 at 6:38 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Fri, Mar 29, 2019 at 04:05:55PM -0700, Christopher Barker wrote:
This proposal would provide a minor gain for an even more minor disruption.
I don't think that is correct. I think you are underestimating the gain and exaggerating the disruption :-)
I am very confused - I made that statement in response to a post of yours ( unless I got the attribution wrong) in which you seemed to be arguing against the proposal.
That was the other Steven, the one who spells his name with a PH instead of a V :-) -- Steven
Steven D'Aprano writes:
The correct solution is a verbose statement:
if string.startswith("spam"): string = string[:len("spam")]
This is harder to write than I thought! (The slice should be 'len("spam"):'.) But s/The/A/: string = re.sub("^spam", "", string) And a slightly incorrect solution (unless you really do want to remove all spam, which most people do, but might not apply to "tooth"): string = string.replace("spam", "")
A pair of "cut" methods (cut prefix, cut suffix) fills a real need,
But do they, really? Do we really need multiple new methods to replace a dual-use one-liner, which also handles outfile = re.sub("\\.bmp$", ".jpg", infile) in one line? I concede that the same argument was made against startswith/endswith, and they cleared the bar. Python is a lot more complex now, though, and I think the predicates are more frequently useful.
and will avoid a lot of mistaken bug reports/questions.
That depends on analogies to other languages. Coming from Emacs, I'm not at all surprised that .strip takes a character class as an argument and strips until it runs into a character not in the class. Evidently others have different intuition. If that's from English, and they know about cutprefix/cutsuffix, yeah, they won't make the mistake. If it's from another programming language they know, or they don't know about cutprefix, they may just write "string.strip('.jpg')" without thinking about it and it (sometimes) works, then they report a bug when it doesn't. Remember, these folks are not understanding the docs, and very likely not reading them.
As for the disruption,
The word is "complexity". Where do you get "disruption" from?
code is a cost, but there is also the uncounted opportunity cost of *not* adding this useful battery.
Obviously some people think it's useful. Nobody denies that. The problem is *measuring* the opportunity cost of not having the battery, or the "usefulness" of the battery, as well as measuring the cost of complexity. Please stop caricaturing those who oppose the change as Luddites.
I can only think of one scenario where this change might break someone's code:
Again, who claimed it would break code?
The functionality is clear, meets a real need, is backwards compatible, and has no significant downsides. The only hard part is bikeshedding names for the methods:
lcut rcut cutprefix cutsuffix ltrim rtrim prestrip poststrip etc.
Am I wrong about any of these statements?
It's not obvious to me from the names that the startswith/endswith test is included in the method, although on reflection it would be weird if it wasn't. Still, I wouldn't be surprised to see if string.startswith("spam"): string.cutprefix("spam") in a new user's code. You're wrong about "no significant downsides," in the sense that that's the wrong criterion. The right criterion is "if we add a slew of features that clear the same bar, does the total added benefit from that set exceed the cost?" The answer to that question is not a trivial extrapolation from the question you did ask, because the benefits will increase approximately linearly in the number of such features, but the cost of additional complexity is generally superlinear. I also disagree they meet a real need, as explained above. They're merely convenient. And the bikeshedding isn't hard. In the list above, cutprefix/ cutsuffix are far and away the best.
On Sun, Mar 31, 2019 at 03:05:59AM +0900, Stephen J. Turnbull wrote:
Steven D'Aprano writes:
The correct solution is a verbose statement:
if string.startswith("spam"): string = string[:len("spam")]
This is harder to write than I thought! (The slice should be 'len("spam"):'.) But s/The/A/:
string = re.sub("^spam", "", string)
Indeed, you're right that there can be other solutions, but whether they are "correct" depends on how one defines correct :-) I don't consider something that pulls in the heavy bulldozer of regexes to crack this peanut to be the right way to solve the problem, but YMMV. But for what it's worth, a regex solution is likely to be significantly slower -- see below.
And a slightly incorrect solution (unless you really do want to remove all spam, which most people do, but might not apply to "tooth"):
string = string.replace("spam", "")
Sorry, that's not "slightly" incorrect, that is completely incorrect, for precisely the reason you state: it replaces *all* matching substrings, not just the leading prefix. I don't see a way to easily use replace to implement a prefix cut. I supose one might do: string = string[:-len(suffix)] + string[-len(suffix):].replace(suffix, '') but I haven't tried it and it sure isn't what I would call easy or obvious.
A pair of "cut" methods (cut prefix, cut suffix) fills a real need,
But do they, really? Do we really need multiple new methods to replace a dual-use one-liner, which also handles
outfile = re.sub("\\.bmp$", ".jpg", infile)
Solutions based on regexes are far less discoverable: - all those people who have reported "bugs" in lstrip() and rstrip() could have thought of using a regex instead but didn't; - they involve reading what is effectively another programming language which uses cryptic symbols like "$" instead of words like "suffix". We aren't the Perl community where regexes are the first hammer we reach for every time we need to drive a screw :-) I had to read your re.sub() call twice before I convinced myself that it only replaced a suffix. And we also have to deal with the case where we want to delete a substring containing metacharacters: # Ouch! re.sub(r'\\\.\$$', '', string) # cut literal \.$ suffix Additionally, a regex solution is likely to be slower than even a pure-Python solution, let alone a string method. On my computer, regexes are three times slower than a Python function: $ python3.5 -m timeit -s "import re" "re.sub('eese$', '', 'spam eggs cheese')" 100000 loops, best of 3: 3.75 usec per loop $ python3.5 -m timeit -s "def rcut(suff, s): return s[:-len(suff)] if s.endswith(suff) else s" "rcut('eese', 'spam eggs cheese')" 1000000 loops, best of 3: 1.22 usec per loop
in one line? I concede that the same argument was made against startswith/endswith, and they cleared the bar. Python is a lot more complex now, though, and I think the predicates are more frequently useful.
and will avoid a lot of mistaken bug reports/questions.
That depends on analogies to other languages.
I don't think it matters that much. Of course it doesn't help if you come to Python from a language where strip() deletes a prefix or suffix, but even if you don't, as I don't, there's something about the pattern: string = string.lstrip("spam") which looks like it ought to remove a prefix rather than a set of characters. I've fallen for that error myself.
Coming from Emacs, I'm not at all surprised that .strip takes a character class as an argument and strips until it runs into a character not in the class.
And neither am I... until I forget, and get surprised that it doesn't work that way. This post is already too long, so in the interest of brevity and my dignity I'll skip the anecdote about the time I too blundered publicly about the "bug" in [lr]strip.
Evidently others have different intuition. If that's from English, and they know about cutprefix/cutsuffix, yeah, they won't make the mistake. If it's from another programming language they know, or they don't know about cutprefix, they may just write "string.strip('.jpg')" without thinking about it and it (sometimes) works, then they report a bug when it doesn't. Remember, these folks are not understanding the docs, and very likely not reading them.
Its not reasonable to hold the failure of the proposed new methods to prevent *all* erroneous uses of [lr]strip against them. Short of breaking backwards compatibility and changing strip() to remove_characters_from_a_set_not_a_substring_read_the_docs_before_reporting_any_bugs() there's always going to be *someone* who makes a mistake. But with an easily discoverable alternative available, the number of such errors should plummett as people gradually migrate to 3.8 or above.
As for the disruption,
The word is "complexity". Where do you get "disruption" from?
If you had read the text I quoted before trimming it, you would have seen that it was from Chris Barker: On Fri, Mar 29, 2019 at 04:05:55PM -0700, Christopher Barker wrote:
This proposal would provide a minor gain for an even more minor disruption.
I try very hard to provide enough context that my comments are understandable, and I don't always succeed, but the reader has to meet me part way by at least skimming the quoted text for context before questioning me :-)
code is a cost, but there is also the uncounted opportunity cost of *not* adding this useful battery.
Obviously some people think it's useful. Nobody denies that.
Well, further on you do question whether it meets a real need, so there is at least one :-)
The problem is *measuring* the opportunity cost of not having the battery, or the "usefulness" of the battery, as well as measuring the cost of complexity.
We have never objectively measured these things before, because they can't be. We don't even have a good, objective measurement of complexity of the language -- but if we did, I'm pretty sure that adding a pair of fairly simple, self-explanatory methods to the str class would not increase it by much. We're on steadier ground if we talk about complexity of the user's code. In that case, whether we measure the complexity of a program by lines of code or number of functions or some other more complicated measurement, it ought to be self-evident that being able to replace a helper function with a built-in will slightly reduce complexity. For the sake of the argument, if we can decrease the complexity of a thousand user programs by 1 LOC each, at the cost of increasing the complexity of the interpreter by 100 LOC, isn't that a cost worth paying? I think it is.
Please stop caricaturing those who oppose the change as Luddites.
That's a grossly unjust misrepresentation of my arguments. Nothing I have said can be fairly read as a caricature of the opposing point, let alone as attacks on others for being Luddites. On the contrary: *twice* I have acknowledged that a level of caution about adding new features is justified. My argument is that in *this* case, the cost-benefit analysis falls firmly on the "benefit" side, not that any opposition is misguided. Whereas your attack on me comes perilously close to poisoning the well: "oh, pay no attention to his arguments, he is the sort of person who caricatures those who disagree as Luddites".
I can only think of one scenario where this change might break someone's code:
Again, who claimed it would break code?
Any addition of a new feature has the risk of breaking code, and we ought to consider that possibility. [...]
It's not obvious to me from the names that the startswith/endswith test is included in the method, although on reflection it would be weird if it wasn't.
Agreed. We can't be completely explicit about everything, it isn't practical: math.trigonometric_sine_where_the_angle_is_measured_in_radians(x)
Still, I wouldn't be surprised to see
if string.startswith("spam"): string.cutprefix("spam")
in a new user's code.
That's the sort of inefficient code newbies often write, and the fix for that is experience and education. I'm not worried about that, just as I'm not worried about newbies writing: if string.startswith(" "): string = string.lstrip(" ")
You're wrong about "no significant downsides," in the sense that that's the wrong criterion. The right criterion is "if we add a slew of features that clear the same bar, does the total added benefit from that set exceed the cost?" The answer to that question is not a trivial extrapolation from the question you did ask, because the benefits will increase approximately linearly in the number of such features, but the cost of additional complexity is generally superlinear.
I disagree that the benefits of new features scale linearly. There's a certain benefit to having (say) str.strip, and a certain benefit of having (say) string slicing, and a certain benefit of having (say) str.upper, but being able to do *all three* is much more powerful than just being able to do one or another. And I have no idea about the "additional complexity" (of what? the language? the interpreter?) because we don't really have a good way of measuring complexity of a language.
I also disagree they meet a real need, as explained above. They're merely convenient.
I don't understand how you can question whether or not people need to cut prefixes and suffixes in the face of people writing code to cut prefixes and suffixes. (Sometimes *wrong* code.) We have had a few people on this list explicitly state that they cut prefixes and suffixes, there's the evidence of the dozens of people who misused strip() to cut prefixes and suffixes, and there's history of people asking how to do it: https://stackoverflow.com/questions/599953/how-to-remove-the-left-part-of-a-... https://stackoverflow.com/questions/16891340/remove-a-prefix-from-a-string https://stackoverflow.com/questions/1038824/how-do-i-remove-a-substring-from... https://codereview.stackexchange.com/questions/33817/remove-prefix-and-remov... https://www.quora.com/Whats-the-best-way-to-remove-a-suffix-of-a-string-in-P... https://stackoverflow.com/questions/3663450/python-remove-substring-only-at-... This same question comes up time and time again, and you're questioning whether people need to do it. Contrast to a hypothetical suggested feature which doesn't meet a real need (or at least nobody has yet suggested one, as yet): Jonathon Fine's suggestion that we define a generalised "string subtraction" operator. Jonathon explained that this is well-defined within the bounds of free groups and category theory. That's great, but being well-defined is only the first step. What would we use a generalised string subtraction for? What need does it meet? There are easy cases: "abcd" - "d" # remove a suffix -"a" + "abcd" # remove a prefix but in the full generality, it isn't clear what "abcd" - "z" would be useful for. Lacking a use-case for full string subtraction, we can reject adding it as a builtin feature or even a stdlib module.
And the bikeshedding isn't hard. In the list above, cutprefix/ cutsuffix are far and away the best.
Well I'm glad we agree on that, even if nothing else :-) -- Steven
On Sun, Mar 31, 2019 at 3:44 PM Steven D'Aprano <steve@pearwood.info> wrote:
Of course it doesn't help if you come to Python from a language where strip() deletes a prefix or suffix, but even if you don't, as I don't, there's something about the pattern:
string = string.lstrip("spam")
which looks like it ought to remove a prefix rather than a set of characters. I've fallen for that error myself.
I think it will be far less confusing once there's parallel functions for prefix/suffix removal. Actually, this is an argument in favour of matching that pattern; if people see lstrip() and lcut() as well as rstrip() and rcut(), it's obvious that they are similar methods, and you can nip over to the docs to check which one you want ("oh right, that makes sense, cut snips off a word but strip takes off a set of letters"). But even if it's called cutprefix/cutsuffix (to match hasprefix/hassuffix... oh wait, I mean startswith/endswith), there's at least a _somewhat_ better chance that people will grab the right tool. Plus, it'll be easy to deal with the problems when they come up - "hey, strip has a weird bug" :: "ah, you want cut instead". If we're bikeshedding the actual method names, I think it would be good to have a list of viable options. A quick skim through the thread gives me these: * cut_prefix/cut_suffix * strip_prefix/strip_suffix * cut_start/cut_end * Any of the above with the underscore removed * lcut/rcut * ltrim/rtrim (and maybe trim) * truncate (end only, no from-start equivalent) Of them, I think cutprefix/cutsuffix (no underscore) and lcut/rcut are the strongest contenders, but that's just my opinion. Have I missed anyone's favourite spelling? Is there a name that parallels startswith/endswith? Regardless of the method name, IMO the functions should accept a tuple of test strings, as startswith/endwith do. That's a feature that can't easily be spelled in a one-liner. (Though stacked suffixes shouldn't all be removed - "asdf.jpg.png".cutsuffix((".jpg", ".png")) should return "asdf.jpg", not "asdf".) ChrisA
On Sun, Mar 31, 2019 at 04:48:36PM +1100, Chris Angelico wrote:
Regardless of the method name, IMO the functions should accept a tuple of test strings, as startswith/endwith do. That's a feature that can't easily be spelled in a one-liner. (Though stacked suffixes shouldn't all be removed - "asdf.jpg.png".cutsuffix((".jpg", ".png")) should return "asdf.jpg", not "asdf".)
There's a slight problem with that: what happens if more than one suffix matches? E.g. given: "musical".lcut(('al', 'ical')) should the suffix "al" be removed, leaving "music"? (First match wins.) Or should the suffix "ical" be removed, leaving "mus"? (Longest match wins.) I don't think we can decide which is better, and I'm not keen on a keyword argument to choose one or the other, so I suggest we stick to the 90% solution of only supporting a single suffix. We can always revisit that in the future. -- Steven
вс, 31 мар. 2019 г. в 11:36, Steven D'Aprano <steve@pearwood.info>:
There's a slight problem with that: what happens if more than one suffix matches? E.g. given:
"musical".lcut(('al', 'ical'))
should the suffix "al" be removed, leaving "music"? (First match wins.)
Or should the suffix "ical" be removed, leaving "mus"? (Longest match wins.)
I think you should choose "First match wins", because in this case you can make "Longest match wins" as `"musical".lcut(tuple(sorted(('al, 'ical'))))`. But if you choose "Longest match wins" there is no chance to achieve "First match wins" behaviour. with kind regards, -gfg
вс, 31 мар. 2019 г. в 11:45, Kirill Balunov <kirillbalunov@gmail.com>:
вс, 31 мар. 2019 г. в 11:36, Steven D'Aprano <steve@pearwood.info>:
There's a slight problem with that: what happens if more than one suffix matches? E.g. given:
"musical".lcut(('al', 'ical'))
should the suffix "al" be removed, leaving "music"? (First match wins.)
Or should the suffix "ical" be removed, leaving "mus"? (Longest match wins.)
I think you should choose "First match wins", because in this case you can make "Longest match wins" as `"musical".lcut(tuple(sorted(('al, 'ical'))))`. But if you choose "Longest match wins" there is no chance to achieve "First match wins" behaviour.
with kind regards, -gfg
Sorry it should be `.rcut` instead of `.lcut` in `"musical".*rcut*(tuple(sorted(('al, 'ical'))))` at the first place. I will prefer names `.lstrip` and `.rstrip` instead of *cut versions. with kind regards, -gdg
Sorry one more time (it is early morning and I should drink a cup of coffee first before posting here). Of course it should be `tuple(sorted(('al', 'ical'), key=len, reverse=True))`, I hope it was obvious to everyone from the very beginning. with kind regards, -gdg вс, 31 мар. 2019 г. в 11:53, Kirill Balunov <kirillbalunov@gmail.com>:
вс, 31 мар. 2019 г. в 11:45, Kirill Balunov <kirillbalunov@gmail.com>:
вс, 31 мар. 2019 г. в 11:36, Steven D'Aprano <steve@pearwood.info>:
There's a slight problem with that: what happens if more than one suffix matches? E.g. given:
"musical".lcut(('al', 'ical'))
should the suffix "al" be removed, leaving "music"? (First match wins.)
Or should the suffix "ical" be removed, leaving "mus"? (Longest match wins.)
I think you should choose "First match wins", because in this case you can make "Longest match wins" as `"musical".lcut(tuple(sorted(('al, 'ical'))))`. But if you choose "Longest match wins" there is no chance to achieve "First match wins" behaviour.
with kind regards, -gfg
Sorry it should be `.rcut` instead of `.lcut` in `"musical".*rcut*(tuple(sorted(('al, 'ical'))))` at the first place. I will prefer names `.lstrip` and `.rstrip` instead of *cut versions.
with kind regards, -gdg
That's why strip_prefix(suffix) is a better name, can't double think. On март 31 2019, at 11:56 утра, Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Mar 31, 2019 at 07:35:22PM +1100, Steven D'Aprano wrote:
"musical".lcut(('al', 'ical')) Oops, typo, I was thinking rcut and wrote lcut :-( -- Steven
Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Sun, Mar 31, 2019 at 7:36 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Mar 31, 2019 at 04:48:36PM +1100, Chris Angelico wrote:
Regardless of the method name, IMO the functions should accept a tuple of test strings, as startswith/endwith do. That's a feature that can't easily be spelled in a one-liner. (Though stacked suffixes shouldn't all be removed - "asdf.jpg.png".cutsuffix((".jpg", ".png")) should return "asdf.jpg", not "asdf".)
There's a slight problem with that: what happens if more than one suffix matches? E.g. given:
"musical".lcut(('al', 'ical'))
should the suffix "al" be removed, leaving "music"? (First match wins.)
Or should the suffix "ical" be removed, leaving "mus"? (Longest match wins.)
I don't think we can decide which is better, and I'm not keen on a keyword argument to choose one or the other, so I suggest we stick to the 90% solution of only supporting a single suffix.
We can always revisit that in the future.
The only way there could be multiple independent matches is if one is a strict suffix of another (as in your example here). In most cases, this will require semantics at the control of the programmer, so I would say "first match wins" is the only sane definition (as it permits the programmer to order the cuttables to define the desired semantics). The overwhelming majority of use cases won't be affected by this decision, so first-wins won't hurt them. ChrisA
The only reason I would support the idea would be to allow multiple suffixes (or prefixes). Otherwise, it just does too little for a new method. But adding that capability of startswith/endswith makes the cut off something easy to get wrong and non-trivial to implement. That said, I really like Brandt's ideas of expanding the signature of .lstrip/.rstrip instead. mystring.rstrip("abcd") # remove any of these single character suffixes mystring.rstrip(('foo', 'bar', 'baz')) # remove any of these suffixes Yes, the semantics or removals where one is a substring of another would need to be decided. As long as it's documented, any behavior would be fine. Most of the time the issue would be moot. On Sun, Mar 31, 2019, 4:36 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Mar 31, 2019 at 04:48:36PM +1100, Chris Angelico wrote:
Regardless of the method name, IMO the functions should accept a tuple of test strings, as startswith/endwith do. That's a feature that can't easily be spelled in a one-liner. (Though stacked suffixes shouldn't all be removed - "asdf.jpg.png".cutsuffix((".jpg", ".png")) should return "asdf.jpg", not "asdf".)
There's a slight problem with that: what happens if more than one suffix matches? E.g. given:
"musical".lcut(('al', 'ical'))
should the suffix "al" be removed, leaving "music"? (First match wins.)
Or should the suffix "ical" be removed, leaving "mus"? (Longest match wins.)
I don't think we can decide which is better, and I'm not keen on a keyword argument to choose one or the other, so I suggest we stick to the 90% solution of only supporting a single suffix.
We can always revisit that in the future.
-- Steven _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On 2019-03-31 16:48, David Mertz wrote:
The only reason I would support the idea would be to allow multiple suffixes (or prefixes). Otherwise, it just does too little for a new method. But adding that capability of startswith/endswith makes the cut off something easy to get wrong and non-trivial to implement.
That said, I really like Brandt's ideas of expanding the signature of .lstrip/.rstrip instead.
mystring.rstrip("abcd") # remove any of these single character suffixes
It removes _all_ of the single character suffixes.
mystring.rstrip(('foo', 'bar', 'baz')) # remove any of these suffixes
In keeping with the current behaviour, it would strip _all_ of these suffixes.
Yes, the semantics or removals where one is a substring of another would need to be decided. As long as it's documented, any behavior would be fine. Most of the time the issue would be moot.
On Sun, Mar 31, 2019, 4:36 AM Steven D'Aprano <steve@pearwood.info <mailto:steve@pearwood.info>> wrote:
On Sun, Mar 31, 2019 at 04:48:36PM +1100, Chris Angelico wrote:
> Regardless of the method name, IMO the functions should accept a tuple > of test strings, as startswith/endwith do. That's a feature that can't > easily be spelled in a one-liner. (Though stacked suffixes shouldn't > all be removed - "asdf.jpg.png".cutsuffix((".jpg", ".png")) should > return "asdf.jpg", not "asdf".)
There's a slight problem with that: what happens if more than one suffix matches? E.g. given:
"musical".lcut(('al', 'ical'))
should the suffix "al" be removed, leaving "music"? (First match wins.)
Or should the suffix "ical" be removed, leaving "mus"? (Longest match wins.)
I don't think we can decide which is better, and I'm not keen on a keyword argument to choose one or the other, so I suggest we stick to the 90% solution of only supporting a single suffix.
We can always revisit that in the future.
On Sun, Mar 31, 2019 at 12:09 PM MRAB <python@mrabarnett.plus.com> wrote:
That said, I really like Brandt's ideas of expanding the signature of .lstrip/.rstrip instead.
mystring.rstrip("abcd") # remove any of these single character suffixes
It removes _all_ of the single character suffixes.
mystring.rstrip(('foo', 'bar', 'baz')) # remove any of these suffixes
In keeping with the current behaviour, it would strip _all_ of these suffixes.
Yes, the exact behavior would need to be documented. The existing case indeed removes *ALL* of the single letter suffixes. Clearly that behavior cannot be changed (nor would I want to, that behavior is useful). It's a decision about whether passing a tuple of substrings would remove all of them (perhaps repeatedly) or only one of them. And if only one, is it "longest wins" or "first wins." As I say, any choice of the semantics would be fine with me if it were documented... since this edge case will be uncommon in most uses (certainly in almost all of my uses). E.g. basename = fname.rstrip(('.jpg', '.png', 'gif')) This is rarely ambiguous, and does something concretely useful that I've coded many times. But what if: fname = 'silly.jpg.png.gif.png.jpg.gif.jpg' I'm honestly not sure what behavior would be useful most often for this oddball case. For the suffixes, I think "remove them all" is probably the best; that is consistent with thinking of the string passed in the existing signature of .rstrip() as an iterable of characters. But even if the decision was made to "only remove the single thing at end", I'd still find the enhancement useful. Sure, once in a while someone might trip over the choice of semantics in this edge case, but if it were documented, no big deal. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On 2019-03-31 17:17, David Mertz wrote:
On Sun, Mar 31, 2019 at 12:09 PM MRAB <python@mrabarnett.plus.com <mailto:python@mrabarnett.plus.com>> wrote:
> That said, I really like Brandt's ideas of expanding the signature of > .lstrip/.rstrip instead. > > mystring.rstrip("abcd") # remove any of these single character suffixes
It removes _all_ of the single character suffixes.
> mystring.rstrip(('foo', 'bar', 'baz')) # remove any of these suffixes
In keeping with the current behaviour, it would strip _all_ of these suffixes.
Yes, the exact behavior would need to be documented. The existing case indeed removes *ALL* of the single letter suffixes. Clearly that behavior cannot be changed (nor would I want to, that behavior is useful).
It's a decision about whether passing a tuple of substrings would remove all of them (perhaps repeatedly) or only one of them. And if only one, is it "longest wins" or "first wins." As I say, any choice of the semantics would be fine with me if it were documented... since this edge case will be uncommon in most uses (certainly in almost all of my uses).
E.g.
basename = fname.rstrip(('.jpg', '.png', 'gif'))
This is rarely ambiguous, and does something concretely useful that I've coded many times. But what if:
fname = 'silly.jpg.png.gif.png.jpg.gif.jpg'
I'm honestly not sure what behavior would be useful most often for this oddball case. For the suffixes, I think "remove them all" is probably the best; that is consistent with thinking of the string passed in the existing signature of .rstrip() as an iterable of characters.
But even if the decision was made to "only remove the single thing at end", I'd still find the enhancement useful. Sure, once in a while someone might trip over the choice of semantics in this edge case, but if it were documented, no big deal.
Could/should we borrow from .replace, which accepts a replace count?
On 31Mar2019 19:35, Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Mar 31, 2019 at 04:48:36PM +1100, Chris Angelico wrote:
Regardless of the method name, IMO the functions should accept a tuple of test strings, as startswith/endwith do.
I did not know that!
That's a feature that can't easily be spelled in a one-liner. (Though stacked suffixes shouldn't all be removed - "asdf.jpg.png".cutsuffix((".jpg", ".png")) should return "asdf.jpg", not "asdf".)
There's a slight problem with that: what happens if more than one suffix matches? E.g. given:
"musical".lcut(('al', 'ical'))
should the suffix "al" be removed, leaving "music"? (First match wins.)
Or should the suffix "ical" be removed, leaving "mus"? (Longest match wins.)
I don't think we can decide which is better, and I'm not keen on a keyword argument to choose one or the other, so I suggest we stick to the 90% solution of only supporting a single suffix.
This is easy to decide: first match wins. That is (a) simple and (b) very predictable for users. You can easily get longest-match behaviour from this by sorting the suffixes. The reverse does not hold. If anyone opposes my reasoning I can threaten them with my partner's story about Netscape proxy, where match rules rules were not processed in the order in the config file but by longest regexp pattern: yes, the longest regexp itself, nothing to do with what it matched. Config stupidity ensues. Do things in the order supplied: that way the user has control. Doing things by length is imposing policy which can't be circumvented. Cheers, Cameron Simpson <cs@cskk.id.au>
On 3/31/19 1:48 AM, Chris Angelico wrote:
If we're bikeshedding the actual method names, I think it would be good to have a list of viable options. A quick skim through the thread gives me these:
* cut_prefix/cut_suffix * strip_prefix/strip_suffix * cut_start/cut_end * Any of the above with the underscore removed * lcut/rcut * ltrim/rtrim (and maybe trim) * truncate (end only, no from-start equivalent)
Of them, I think cutprefix/cutsuffix (no underscore) and lcut/rcut are the strongest contenders, but that's just my opinion. Have I missed anyone's favourite spelling? Is there a name that parallels startswith/endswith?
without_prefix without_suffix They're a little longer, but IMO "without" helps reenforce the immutability of the underlying string. None of these functions actually remove part of the original string, but rather they return a new string that's the original string without some piece of it.
without_prefix without_suffix
They're a little longer, but IMO "without" helps reenforce the immutability of the underlying string. None of these functions actually remove part of the original string, but rather they return a new string that's the original string without some piece of it.
Which is the case for EVERY string method— we do need to get all wordy for just these two. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On 31 Mar 2019, at 21:32, Christopher Barker <pythonchb@gmail.com> wrote:
without_prefix without_suffix
They're a little longer, but IMO "without" helps reenforce the immutability of the underlying string. None of these functions actually remove part of the original string, but rather they return a new string that's the original string without some piece of it.
Which is the case for EVERY string method— we do need to get all wordy for just these two.
Agreed! Let’s not remake the mistakes of the past in order to try to keep consistency. / Anders
On Mon, Apr 01, 2019 at 12:02:25PM +1300, Greg Ewing wrote:
Dan Sommers wrote:
without_prefix without_suffix
They're a little longer, but IMO "without" helps reenforce the immutability of the underlying string.
We don't seem to worry about that distinction for other string methods, such as lstrip and rstrip.
Perhaps we ought to. In the spirit of today's date, let me propose renaming existing string methods to be more explicit, e.g.: str.new_string_in_uppercase str.new_string_with_substrings_replaced str.new_string_filled_to_the_given_length_with_zeroes_on_the_left str.new_string_with_character_translations_not_natural_language_translations The best thing is that there will no longer be any confusion as to whether you are looking at a Unicode string or a byte-string: a = a.new_string_trimmed_on_the_left() a = a.new_bytes_trimmed_on_the_left() *wink* -- Steven
On Mon, Apr 1, 2019 at 11:00 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Mon, Apr 01, 2019 at 12:02:25PM +1300, Greg Ewing wrote:
Dan Sommers wrote:
without_prefix without_suffix
They're a little longer, but IMO "without" helps reenforce the immutability of the underlying string.
We don't seem to worry about that distinction for other string methods, such as lstrip and rstrip.
Perhaps we ought to. In the spirit of today's date, let me propose renaming existing string methods to be more explicit, e.g.:
str.new_string_in_uppercase str.new_string_with_substrings_replaced str.new_string_filled_to_the_given_length_with_zeroes_on_the_left str.new_string_with_character_translations_not_natural_language_translations
Excellent! Love it. Add that to the feature list for Python 2.8. But for those of us still discussing the 3.x line, do we need to put together a PEP about this? There seems to be a lot of bikeshedding, a lot of broad support, and a small amount of "bah, don't need it, use regex/long expression/etc". Who wants to champion the proposal? Do we have a core dev who's interested in sponsoring it? ChrisA
Steven D'Aprano wrote:
The best thing is that there will no longer be any confusion as to whether you are looking at a Unicode string or a byte-string:
a = a.new_string_trimmed_on_the_left() a = a.new_bytes_trimmed_on_the_left()
To keep the RUE happy, instead of a "string" we should call it a "mathematically_valid_encoded_code_point_sequence". -- Greg
On Mon, 1 Apr 2019 at 01:01, Steven D'Aprano <steve@pearwood.info> wrote:
On Mon, Apr 01, 2019 at 12:02:25PM +1300, Greg Ewing wrote:
Dan Sommers wrote:
without_prefix without_suffix
They're a little longer, but IMO "without" helps reenforce the immutability of the underlying string.
We don't seem to worry about that distinction for other string methods, such as lstrip and rstrip.
Perhaps we ought to. In the spirit of today's date, let me propose renaming existing string methods to be more explicit, e.g.:
str.new_string_in_uppercase str.new_string_with_substrings_replaced str.new_string_filled_to_the_given_length_with_zeroes_on_the_left str.new_string_with_character_translations_not_natural_language_translations
The best thing is that there will no longer be any confusion as to whether you are looking at a Unicode string or a byte-string:
a = a.new_string_trimmed_on_the_left() a = a.new_bytes_trimmed_on_the_left()
*wink*
In order to support duck typing can I suggest that we also add bytes.new_string_trimmed_on_the_left() str.new_bytes_trimmed_on_the_left() These will do the obvious thing, so they do not need documenting. Obviously only needed for Python 3.x, as in 2.x people never use variables that may sometimes be strings and sometimes bytes. *wink* (although the joke is obvious, and so should not need documenting :-P) Paul
On 3/31/19 1:48 AM, Chris Angelico wrote:
* strip_prefix/strip_suffix
I don't like "strip" because .strip already has a different meaning, although the inclusion of prefix/suffix makes the intended sematics clear enough for the new methods. I wonder if it might make the semantics of .strip even harder to learn, though.
* cut_prefix/cut_suffix * cut_start/cut_end
Substitute "trim" or "crop" for "cut" in any of the above, because "cut" might mean "split". I don't think it's very important, and prefer "cut" because it will come early in an alphabetical list of public string methods (discoverability for the new methods).
* Any of the above with the underscore removed * lcut/rcut * ltrim/rtrim (and maybe trim) * truncate (end only, no from-start equivalent)
Dan Sommers writes:
without_prefix without_suffix
They're a little longer, but IMO "without" helps reenforce the immutability of the underlying string. None of these functions actually remove part of the original string, but rather they return a new string that's the original string without some piece of it.
I think this rationale is plausible but don't think it's important enough to justify the additional length over "cut". Another possibility to address this would be to use past tense: prefix_trimmed prefix_cut # I think this is awkward. but writing it out makes me think "nah". Regarding allowing a tuple argument, I don't see any reason not to take the "cut the first matching affix and return what's left" semantics, which is closely analogous to how startswith/endswith work. As long as the verb isn't "strip", of course. For me, this possibility puts the last nail in any variation on "strip". I don't see a good reason for the "longest match" variation, except the analogy to POSIX sematics for regexps, which seems pretty weak to me. Steve
Steven D'Aprano:
Stephen J. Turnbull:
And the bikeshedding isn't hard. In the list above, cutprefix/ cutsuffix are far and away the best.
Well I'm glad we agree on that, even if nothing else :-)
I prefer “strip_prefix” because of the analogy to strip() which doesn’t do anything if the characters aren’t present. Introducing a new word “cut” seems unnecessary and confusing and I’d wager it will increase the probability of: if s.startswith(‘foo’): s = s.cutprefix(‘foo’) Obviously this is a guess! I also don’t understand why not using the underscore is preferable? It seems just to be poor form. / Anders
On Fri, 29 Mar 2019 at 23:07, Christopher Barker <pythonchb@gmail.com> wrote:
The proposal at hand is to add two fairly straightforward methods to string. So:
Some of what you are calling digressions are actually questioning the design choice behind that proposal. Specifically, there's no particular justification given for making these methods rather than standalone functions. But OK, let's stick to the points you want to make here.
We are balancing equities here. We have a plethora of changes, on the one side taken by itself each of which is an improvement, but on the other taken as a group they greatly increase the difficulty of learning to read Python programs fluently.
Unless the methods are really poorly named, then this will make them maybe a tiny bit more readable, not less. But tiny. So “irrelevant” may be appropriate here.
And how do we decide if they are poorly named, given that it's *very* hard to get real-world usage experience for a core Python change before it's released (essentially no-one uses pre-releases for anything other than testing that the release doesn't break their code). Note that the proposed name (trim) is IMO "poorly named", because a number of languages in my experience use that name for what Python calls "strip", so there would be continual confusion (for me at least) over which name meant which behaviour...
So we set a bar that the change must clear, and the ability of the change to clear it depends on the balance of equities.
Exactly — small impact, low bar.
If we accept your statement that it's a small impact. I contend that the confusion that this would cause between strip and trim is not small. It's not *huge*, but it's not small... We can agree to differ, but if we do then don't expect me to agree to your statement that the bar can be low, you need to persuade me to agree that the impact is low if you want me to agree on the matter of the bar.
In this case, where it requires C support and is not possible to "from __future__", the fact that library maintainers can't use it until they drop support for past versions of Python weakens the argument for the change by excluding important bodies of code from using it.
But there is no need for __future__ — it’s not a breaking change. It could be back ported to any version we want. Same as a __future__ import.
OTOH, it's a new feature, so it won't be acceptable for backporting. Sorry, but those are the rules. What "we" want isn't relevant here, unless the "we" in question is the Python core devs, and the core devs have established the "no backports of new features" rule over many years, and won't be likely to change it for something that you yourself are describing as "small impact". Paul
To me this is really surprising that 28 years old language has some weird methods like str.swapcase(), but none to cut string from left or right, and two of them which exist only accept string mask. On март 30 2019, at 12:21 дня, Paul Moore <p.f.moore@gmail.com> wrote:
On Fri, 29 Mar 2019 at 23:07, Christopher Barker <pythonchb@gmail.com> wrote:
The proposal at hand is to add two fairly straightforward methods to string. So:
Some of what you are calling digressions are actually questioning the design choice behind that proposal. Specifically, there's no particular justification given for making these methods rather than standalone functions. But OK, let's stick to the points you want to make here.
We are balancing equities here. We have a plethora of changes, on the one side taken by itself each of which is an improvement, but on the other taken as a group they greatly increase the difficulty of learning to read Python programs fluently.
Unless the methods are really poorly named, then this will make them maybe a tiny bit more readable, not less. But tiny. So “irrelevant” may be appropriate here. And how do we decide if they are poorly named, given that it's *very* hard to get real-world usage experience for a core Python change before it's released (essentially no-one uses pre-releases for anything other than testing that the release doesn't break their code).
Note that the proposed name (trim) is IMO "poorly named", because a number of languages in my experience use that name for what Python calls "strip", so there would be continual confusion (for me at least) over which name meant which behaviour...
So we set a bar that the change must clear, and the ability of the change to clear it depends on the balance of equities.
Exactly — small impact, low bar. If we accept your statement that it's a small impact. I contend that the confusion that this would cause between strip and trim is not small. It's not *huge*, but it's not small... We can agree to differ, but if we do then don't expect me to agree to your statement that the bar can be low, you need to persuade me to agree that the impact is low if you want me to agree on the matter of the bar.
In this case, where it requires C support and is not possible to "from __future__", the fact that library maintainers can't use it until they drop support for past versions of Python weakens the argument for the change by excluding important bodies of code from using it.
But there is no need for __future__ — it’s not a breaking change. It could be back ported to any version we want. Same as a __future__ import. OTOH, it's a new feature, so it won't be acceptable for backporting. Sorry, but those are the rules. What "we" want isn't relevant here, unless the "we" in question is the Python core devs, and the core devs have established the "no backports of new features" rule over many years, and won't be likely to change it for something that you yourself are describing as "small impact".
Paul _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Sat, 30 Mar 2019 at 10:29, Alex Grigoryev <evrial@gmail.com> wrote:
To me this is really surprising that 28 years old language has some weird methods like str.swapcase(), but none to cut string from left or right, and two of them which exist only accept string mask.
As someone who was programming 28 years ago, I can confirm that the things that were "obviously useful" that long ago are vastly different from the things that are "obviously useful" today. Requirements evolve, use cases evolve, languages evolve. The good thing about general purpose languages like Python (as opposed to languages like SQL, which I use in my day job) is that you can easily handle new requirements by writing your own functions and utilities, which takes the pressure off the language design to keep up with every change in requirements and trends. The bad thing about it is that it's sometimes difficult to distinguish between significant improvements (which genuinely warrant language/stdlib changes) and insignificant ones (which can be handled by "write your own function"). Things like str.swapcase are a good example of that experience. It probably seemed like a useful little function at the time, not much overhead, maybe useful, people coming from C had something like this and found it helpful, so why not? But then Unicode came along, and there was a chunk of maintenance work needed to update swapcase. And there were probably bugs that got fixed. And as you point out, the function is probably barely ever used nowadays. So was it worth the effort invested in adding it, and maintaining it all those years? "It's only a simple addition of a straightforward string method". Is str.trim like str.swapcase, or like str.split? Who knows, at this point? The best any of us with the experience of seeing proposals like this come up regularly can do, is to push back, make the proposer justify the suggestion, try to make the proposer consider whether while his idea seems great right now, will it feel more like str.swapcase in a few years? And sometimes that pushback *is* too conservative, and an idea is good. But it still needs someone to implement it, document it, and integrate it into the language - the proposer isn't always able (or willing) to do that, so again there's a question of who does the work? In the case of str.swapcase, the "proposer" was probably the person implementing the str class, and so they did the work and it was very little extra to do. Nowadays the str class is a lot more complex, and Unicode rules are far less straightforward than ASCII was 28 years ago - so maybe now they wouldn't have bothered[1]. Sorry, that went a lot further than I originally intended - hopefully it's useful background, though. Paul [1] One thing I don't know (apologies if it's been answered earlier in the thread). Are you expecting to implement this change yourself (you'd need to know C, so it's perfectly OK if the answer is no), and if so, have you tried to do so? Personally, I don't have any feel for how complex the proposed new methods would be to implement, but I'd be much more willing to accept the word of someone who's written the code that it's a "simple change". How easy it is to implement isn't the whole story (as I mentioned above) but it is relevant.
On Sat, Mar 30, 2019 at 11:40:18AM +0000, Paul Moore wrote:
Is str.trim like str.swapcase, or like str.split? Who knows, at this point?
I think you are making a rhetorical point, but not a very good one. I think we all know, or at least *should* know, that this proposal is much closer to split than swapcase. Most of us have had to cut a prefix or a suffix from a string, often a file extension. Its not as common as, say, stripping whitespace, but it happens often enough. Stackoverflow and the mailing lists and even the bug tracker are full of people asking about "bugs" in str.[lr]split because they've tried to use those methods to cut prefixes and suffixes, so we know that this functionality is needed far more often than swapcase, and easy to get it wrong. We can't say the same about swapcase. Even in Python 1.5, it was a gimmick. I can only think of a single use-case for it: "I've typed a whole lot of text without noticing that Caps Lock was on, so it looks like 'hELLO wORLD' by mistake." [...]
try to make the proposer consider whether while his idea seems great right now, will it feel more like str.swapcase in a few years? And sometimes that pushback *is* too conservative, and an idea is good. But it still needs someone to implement it, document it, and integrate it into the language - the proposer isn't always able (or willing) to do that, so again there's a question of who does the work?
And that's a very good point -- if there's no volunteer willing and able to do the work, even the best ideas can languish, sometimes for years. But that's not a reason to *reject* an idea. -- Steven
On Sat, Mar 30, 2019, 8:42 AM Steven D'Aprano <steve@pearwood.info> wrote:
Most of us have had to cut a prefix or a suffix from a string, often a file extension. Its not as common as, say, stripping whitespace, but it happens often enough.
I do this all the time! I never really thought about wanting a method though. I just spell it like this without much thought: basename = fname.split(".ext")[0] But I suppose a method would be helpful. If we have one, PLEASE no variation of 'trim' in the name. I still forget whether it's .lstrip() or .ltrim() or .stripl() or etc. after 20 years using Python. Lots of languages use trim for Python's strip, so having both with subtly different meanings is a bug magnet. One thing I love about .startswith() and .endswith() is matching multiple options. It's a little funny the multiple options must be a tuple exactly (not a list, not a set, not an iterator), but whatever. It would be about to lack that symmetry in the .cut_suffix() method. E.g now: if fname.endswith(('.jpg', '.png', '.gif)): ... I'd expect to be able to do: basename = fname.cut_suffix(('.jpg', '.png', '.gif))
On 3/30/19 9:03 AM, David Mertz wrote:
On Sat, Mar 30, 2019, 8:42 AM Steven D'Aprano <steve@pearwood.info> wrote:
Most of us have had to cut a prefix or a suffix from a string, often a file extension. Its not as common as, say, stripping whitespace, but it happens often enough.
I do this all the time! I never really thought about wanting a method though. I just spell it like this without much thought:
basename = fname.split(".ext")[0]
This one also works until it doesn't: basename = 'special.extensions.ext'.split(".ext")[0] basename = 'food.pyramid.py'.split(".py")[0] basename = 'build.classes.c'.split(".c")[0] Safer is fname.rsplit('.ext', 1)[0]. There's always os.path.splitext. ;-) Dan
On 2019-03-30 13:03, David Mertz wrote:
On Sat, Mar 30, 2019, 8:42 AM Steven D'Aprano <steve@pearwood.info <mailto:steve@pearwood.info>> wrote:
Most of us have had to cut a prefix or a suffix from a string, often a file extension. Its not as common as, say, stripping whitespace, but it happens often enough.
I do this all the time! I never really thought about wanting a method though. I just spell it like this without much thought:
basename = fname.split(".ext")[0]
But I suppose a method would be helpful. If we have one, PLEASE no variation of 'trim' in the name. I still forget whether it's .lstrip() or .ltrim() or .stripl() or etc. after 20 years using Python. Lots of languages use trim for Python's strip, so having both with subtly different meanings is a bug magnet.
One thing I love about .startswith() and .endswith() is matching multiple options. It's a little funny the multiple options must be a tuple exactly (not a list, not a set, not an iterator), but whatever. It would be about to lack that symmetry in the .cut_suffix() method.
E.g now:
if fname.endswith(('.jpg', '.png', '.gif)): ...
I'd expect to be able to do:
basename = fname.cut_suffix(('.jpg', '.png', '.gif))
I'd much prefer .lcut/.rcut to .cut_prefix/.cut_suffix, to match .lstrip/.rstrip.
On Mar 30, 2019, at 12:30, MRAB <python@mrabarnett.plus.com> wrote:
I'd much prefer .lcut/.rcut to .cut_prefix/.cut_suffix, to match .lstrip/.rstrip.
I agree that we should use either l/r or something to do with start/end. We already have two different ways to say left/start and right/end on the str methods; adding a third way to say left/start/prefix seems silly. But I don’t like cut. Or trim, or any of the other options. They’re all just synonyms for strip; there’s nothing about any of these synonyms that tells you, implies, or even helps you remember that strip takes an iterable of characters and cut takes a substring (or is that a substring or tuple of substrings?) instead of the other way around. A name like lstrip_string would solve that. But so would modifying lstrip to take keyword arguments: s.lstrip("abc") # same as today s.lstrip(chars="abc") # same as above s.lstrip(string="abc") # the new functionality: only strip the whole thing Also, this means that if we do want the substring-or-substrings thing, we don’t have to use the old tuple idiom; since it’s a keyword-only param, we could just as easily have a different keyword param that accepts any iterable of strings. So when you have a set of prefixes, you don’t have to call s.lcut(tuple(prefixes)), you just pass the set as-is to s.lstrip(strings=prefixes)).
Sorry, I thought I was replying to something from today, not a year ago. My mail client decided that all old messages I didn’t read at the time were suddenly brand new additions to the thread, and I didn’t look at the dates. :) Sent from my iPhone
On Mar 4, 2020, at 19:21, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
On Mar 30, 2019, at 12:30, MRAB <python@mrabarnett.plus.com> wrote:
I'd much prefer .lcut/.rcut to .cut_prefix/.cut_suffix, to match .lstrip/.rstrip.
I agree that we should use either l/r or something to do with start/end. We already have two different ways to say left/start and right/end on the str methods; adding a third way to say left/start/prefix seems silly.
But I don’t like cut. Or trim, or any of the other options. They’re all just synonyms for strip; there’s nothing about any of these synonyms that tells you, implies, or even helps you remember that strip takes an iterable of characters and cut takes a substring (or is that a substring or tuple of substrings?) instead of the other way around.
A name like lstrip_string would solve that.
But so would modifying lstrip to take keyword arguments:
s.lstrip("abc") # same as today s.lstrip(chars="abc") # same as above s.lstrip(string="abc") # the new functionality: only strip the whole thing
Also, this means that if we do want the substring-or-substrings thing, we don’t have to use the old tuple idiom; since it’s a keyword-only param, we could just as easily have a different keyword param that accepts any iterable of strings. So when you have a set of prefixes, you don’t have to call s.lcut(tuple(prefixes)), you just pass the set as-is to s.lstrip(strings=prefixes)).
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/WKOFYD... Code of Conduct: http://python.org/psf/codeofconduct/
On Wed, Mar 4, 2020 at 8:13 PM Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
Sorry, I thought I was replying to something from today, not a year ago.
Which is fine — that conversation kind of petered out anyway, and I think reviving it is a fine idea. It sounds like you at least like the idea, so maybe it’s time to reframe the conversation into “how should we do it” rather than “should we do it”. Of course, in the end, a decision can’t be made without a specific implementation, but it’s good to know if it’ll even be considered. For my part, I prefer new methods over keyword parameters: 1) I don’t think any other string methods take keywords. 2) I believe Guido is credited with saying something like: it’s better to write two functions than have one function that takes a keyword argument that selects between two behaviors. I wish I had the time/energy to spearhead this, but I honest don’t. But if someone else does, I’d like to see it happen. -CHB My mail client decided that all old messages I didn’t read at the time were
suddenly brand new additions to the thread, and I didn’t look at the dates. :)
Sent from my iPhone
On Mar 4, 2020, at 19:21, Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
On Mar 30, 2019, at 12:30, MRAB <python@mrabarnett.plus.com> wrote:
I'd much prefer .lcut/.rcut to .cut_prefix/.cut_suffix, to match .lstrip/.rstrip.
I agree that we should use either l/r or something to do with start/end. We already have two different ways to say left/start and right/end on the str methods; adding a third way to say left/start/prefix seems silly.
But I don’t like cut. Or trim, or any of the other options. They’re all just synonyms for strip; there’s nothing about any of these synonyms that tells you, implies, or even helps you remember that strip takes an iterable of characters and cut takes a substring (or is that a substring or tuple of substrings?) instead of the other way around.
A name like lstrip_string would solve that.
But so would modifying lstrip to take keyword arguments:
s.lstrip("abc") # same as today s.lstrip(chars="abc") # same as above s.lstrip(string="abc") # the new functionality: only strip the whole thing
Also, this means that if we do want the substring-or-substrings thing, we don’t have to use the old tuple idiom; since it’s a keyword-only param, we could just as easily have a different keyword param that accepts any iterable of strings. So when you have a set of prefixes, you don’t have to call s.lcut(tuple(prefixes)), you just pass the set as-is to s.lstrip(strings=prefixes)).
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/WKOFYD... Code of Conduct: http://python.org/psf/codeofconduct/
Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/BXBL5Y... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Mar 5, 2020, at 08:27, Christopher Barker <pythonchb@gmail.com> wrote:
On Wed, Mar 4, 2020 at 8:13 PM Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
Sorry, I thought I was replying to something from today, not a year ago.
Which is fine — that conversation kind of petered out anyway, and I think reviving it is a fine idea.
It sounds like you at least like the idea, so maybe it’s time to reframe the conversation into “how should we do it” rather than “should we do it”. Of course, in the end, a decision can’t be made without a specific implementation, but it’s good to know if it’ll even be considered.
Well, I like the idea if someone can come up with a good naming scheme—something that at least reminds me which function is the “set of chars” stripper and which the “substring” stripper, rather than just being two synonyms that people are going to get backward as often as they get them right (and even worse if they’re familiar with another popular language), and something that doesn’t gratuitously add yet another way to say l/left/start (like prefix). And, while I haven’t reread the entire old thread, I don’t think any of the suggestions so far qualified, and I don’t think my own suggestion is that great.
For my part, I prefer new methods over keyword parameters:
1) I don’t think any other string methods take keywords.
Sure, but they mostly go back to the days before keyword arguments existed at all, much less now when they can be implemented relatively easily even in C functions.
2) I believe Guido is credited with saying something like: it’s better to write two functions than have one function that takes a keyword argument that selects between two behaviors.
Yes, but it’s also better to have a function that takes two different keyword arguments than a function that takes one positional argument and type-switches on it to select between two behaviors. The string-or-tuple-of-strings idiom is as clunky as similar things like file-or-contents and file-or-pathname. Although for the case of startswith and friends, there’s another option that didn’t exist in Python 1.0 but does today: startswith(self, *strings), which allows you to write a.startswith(*set_of_prefixes) without caring that you have a set rather than a tuple and without danger of confusing a string vs. a collection of strings. When adding new code that really needs to be consistent with existing APIs, it makes sense to ape those APIs even if they’re clunky, but I’m not convinced that’s necessary here. But if so, fine. If we need to take a string or strings, and we need to be consistent with the rest of the str API and not take keywords, that sounds like a job for a new method, as long as that method actually implies or at least mnemonically helps with “string or strings rather than character set”, like my lstrip_string suggestion.
On Thu, Mar 05, 2020 at 12:45:28PM -0800, Andrew Barnert via Python-ideas wrote:
Well, I like the idea if someone can come up with a good naming scheme—something that at least reminds me which function is the “set of chars” stripper and which the “substring” stripper,
You've been a Python programmer for how many years now? Do you currently have trouble remembering what lstrip and rstrip do? If not, I doubt you will suddenly forget their meaning because we add a pair of new methods with quite different names. Unless, of course, we go down the path of a foolish consistency and choose names which are too similar: lcut lremove ldelete in which case you are right, people would certainly confuse the two. So let's not do that :-) This proposal isn't for a generic "substring" stripper. We already have that, it is called str.replace. This proposal is specifically for removing *prefixes and suffixes* not arbitrary substrings. Referencing the words "prefix" and "suffix" is not "gratuitously" adding "yet another way to say l/left/start" but a critical part of the methods' semantics. If we were to describe what the proposed methods do, we surely would say something like these: cut the prefix delete the prefix remove the prefix strip the prefix trim the prefix because that is what the method does: it cuts/deletes etc the *prefix*, not some arbitrary substring or set of characters. But we wouldn't say: cut the substring on the left because "on the left" is not sufficient. How far on the left? Can the substring be anywhere in the left half of the string? "The red peppers are very red and fresh".cut_left_substring("red") => "The peppers are very red and fresh" I hope that we can agree that some names are simply too long and cumbersome: str.cut_substring_all_the_way_on_left(prefix) We already have `[l|r]strip` methods. If we want to associate the new methods with those, I suggest strip_prefix strip_suffix which will show up right next to `strip` in the docs and state explicitly what they do. (Personally, I prefer to add a little more conceptual distance by calling the method "cut_" but I'm willing to accept "strip" if others like it.) I trust that is clear enough. If not, what else could "strip prefix" mean, if not strip the prefix? Contrast: lstrip_string "Yes, I know it operates on a string, it's a string method after all. But what does it do?" The following is a tangential note about some Python history, and isn't really relevant to the proposal as such. So you can stop reading here without missing anything important. [Christopher]
1) I don’t think any other string methods take keywords.
[Andrew]
Sure, but they mostly go back to the days before keyword arguments existed at all, much less now when they can be implemented relatively easily even in C functions.
Keywords predate string methods. String methods were only added in Python 2.0: https://docs.python.org/2.0/lib/string-methods.html In Python 1.5, strings had no methods: >>> 'a'.upper() Traceback (innermost last): File "<stdin>", line 1, in ? AttributeError: 'string' object has no attribute 'upper' and we used the "string" module functions. I don't see any builtins with keyword arguments in Python 1.5: https://docs.python.org/release/1.5/lib/node26.html but pure Python functions certainly had them. Since string functions were originally written in Python, they could have had keyword arguments had they been desired. -- Steven
On 03/06/2020 04:03 PM, Steven D'Aprano wrote:
On Thu, Mar 05, 2020 at 12:45:28PM -0800, Andrew Barnert via Python-ideas wrote:
Well, I like the idea if someone can come up with a good naming scheme—something that at least reminds me which function is the “set of chars” stripper and which the “substring” stripper,
You've been a Python programmer for how many years now? Do you currently have trouble remembering what lstrip and rstrip do?
Speaking for myself, about 13 years. And, yes, I do occasionally forget that the strips are character based. I can easily imagine it's worse for polyglot programmers.
We already have `[l|r]strip` methods. If we want to associate the new methods with those, I suggest
strip_prefix strip_suffix
Works for me. Easy to add to bytes, too, if somebody is so inclined. -- ~Ethan~
I've defined functions like this in my own utility library and I use them all the time, so I think they're very useful and would like to seem them built in. But I have two functions for each end: def strip_optional_suffix(string, suffix): """ >>> strip_optional_suffix('abcdef', 'def') 'abc' >>> strip_optional_suffix('abcdef', '123') 'abcdef' """ if string.endswith(suffix): return string[:-len(suffix)] return string def strip_required_suffix(string, suffix): """ >>> strip_required_suffix('abcdef', 'def') 'abc' >>> strip_required_suffix('abcdef', '123') Traceback (most recent call last): ... AssertionError: String ends with 'def', not '123' """ if string.endswith(suffix): return string[:-len(suffix)] raise AssertionError('String ends with %r, not %r' % (string[-len(suffix):], suffix)) And I know that I use the required versions much more often, because usually if the suffix isn't there that indicates a bug somewhere that I need to know about. So I'd like to have both the optional and required versions implemented in this proposal, and if that's too much to add then just the required versions. But I suspect most people will be against the idea.
On 03/07/2020 05:31 AM, Alex Hall wrote:
I've defined functions like this in my own utility library and I use them all the time, so I think they're very useful and would like to seem them built in. But I have two functions for each end:
def strip_optional_suffix(string, suffix): ...
def strip_required_suffix(string, suffix): ...
And I know that I use the required versions much more often, because usually if the suffix isn't there that indicates a bug somewhere that I need to know about. So I'd like to have both the optional and required versions implemented in this proposal, and if that's too much to add then just the required versions. But I suspect most people will be against the idea.
Against only having the required versions, yes. However, we do have other functions that take an extra parameter to determine whether to raise an exception, so perhaps: def strip_suffix(string, /, required=False): ... -- ~Ethan~
On Sat, Mar 7, 2020 at 7:06 AM Ethan Furman <ethan@stoneleaf.us> wrote:
On 03/07/2020 05:31 AM, Alex Hall wrote:
I've defined functions like this in my own utility library and I use them all the time, so I think they're very useful and would like to seem them built in. But I have two functions for each end:
def strip_optional_suffix(string, suffix): ...
def strip_required_suffix(string, suffix): ...
And I know that I use the required versions much more often, because usually if the suffix isn't there that indicates a bug somewhere that I need to know about. So I'd like to have both the optional and required versions implemented in this proposal, and if that's too much to add then just the required versions. But I suspect most people will be against the idea.
Against only having the required versions, yes. However, we do have other functions that take an extra parameter to determine whether to raise an exception, so perhaps:
def strip_suffix(string, /, required=False): ...
Maybe. FWIW, I looked at what a few other languages offer, and found that in Go, they use Trim(s, chars) for our s.strip(chars), and they have separate TrimPrefix and TrimSuffix methods. That seems the best solution of the bunch, so I am now okay with using stripprefix and stripsuffix. (Why no _? Because startswith and endswith don't have one either.) -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
On 07Mar2020 08:26, Guido van Rossum <guido@python.org> wrote:
Maybe. FWIW, I looked at what a few other languages offer, and found that in Go, they use Trim(s, chars) for our s.strip(chars), and they have separate TrimPrefix and TrimSuffix methods. That seems the best solution of the bunch, so I am now okay with using stripprefix and stripsuffix. (Why no _? Because startswith and endswith don't have one either.)
I'm somewhat against "strip" in the name, because Python's plain "strip" methods act like PHP and Go trim methods: they strip multiple characters, not fixed strings. My own preference (and personal library use) is cutprefix and cutsuffix. Cheers, Cameron Simpson <cs@cskk.id.au>
On 2020-03-07 23:01, Cameron Simpson wrote:
On 07Mar2020 08:26, Guido van Rossum <guido@python.org> wrote:
Maybe. FWIW, I looked at what a few other languages offer, and found that in Go, they use Trim(s, chars) for our s.strip(chars), and they have separate TrimPrefix and TrimSuffix methods. That seems the best solution of the bunch, so I am now okay with using stripprefix and stripsuffix. (Why no _? Because startswith and endswith don't have one either.)
I'm somewhat against "strip" in the name, because Python's plain "strip" methods act like PHP and Go trim methods: they strip multiple characters, not fixed strings.
My own preference (and personal library use) is cutprefix and cutsuffix.
Go's Trim strips multiple characters, but, as far as I can tell from the docs, TrimPrefix and TrimSuffix strip a single prefix/suffix string.
On 08Mar2020 00:17, MRAB <python@mrabarnett.plus.com> wrote:
On 2020-03-07 23:01, Cameron Simpson wrote:
I'm somewhat against "strip" in the name, because Python's plain "strip" methods act like PHP and Go trim methods: they strip multiple characters, not fixed strings.
My own preference (and personal library use) is cutprefix and cutsuffix.
Go's Trim strips multiple characters, but, as far as I can tell from the docs, TrimPrefix and TrimSuffix strip a single prefix/suffix string.
And right there is the confusion I'd rather avoid. I'd like the affix stuff to have a different name because it behaves differently. We've got strip, lstrip and rstrip with common "strip multiple characters" actions. I think fixed affixes are different enough to warrant a differently spelled name. I hope we can agree to at least avoid using the word "trim", for similar reasons. Cheers, Cameron Simpson <cs@cskk.id.au>
On Sat, Mar 7, 2020 at 4:36 PM Cameron Simpson <cs@cskk.id.au> wrote:
Go's Trim strips multiple characters, but, as far as I can tell from the docs, TrimPrefix and TrimSuffix strip a single prefix/suffix string.
And right there is the confusion I'd rather avoid. I'd like the affix stuff to have a different name because it behaves differently.
Different than Go? or than existing Python methods? I think it's pretty much impossible for Python to remain in sync with other languages, we might well give up on that. We've got strip, lstrip and rstrip with common "strip multiple
characters" actions. I think fixed affixes are different enough to warrant a differently spelled name.
That's why folks are suggesting "*prefix" and *suffix" -- the prefix and suffix make it pretty clear that something different is going on. And I at least think it's pretty descritive. I"ve lost track of what's on the table, but: strip_prefix strip_suffix make it clear that it's something different than the other "strip"s, without gratuitously adding more words for the same thing. And it would show up alphabetically near "strip", which is also good. I'm going to try to refrain from further bike shedding this one -- I'd rather see something done than get my favorite names :-) -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Sat, Mar 7, 2020 at 4:36 PM Cameron Simpson <cs@cskk.id.au> wrote:
Go's Trim strips multiple characters, but, as far as I can tell from the docs, TrimPrefix and TrimSuffix strip a single prefix/suffix string.
And right there is the confusion I'd rather avoid. I'd like the affix stuff to have a different name because it behaves differently.
Different than Go? or than existing Python methods? I think it's
On 2020-03-08 01:51, Christopher Barker wrote: pretty much impossible for Python to remain in sync with other languages, we might well give up on that.
We've got strip, lstrip and rstrip with common "strip multiple characters" actions. I think fixed affixes are different enough to warrant a differently spelled name.
That's why folks are suggesting "*prefix" and *suffix" -- the prefix
and suffix make it pretty clear that something different is going on. And I at least think it's pretty descritive. I"ve lost track of what's on the table, but:
strip_prefix strip_suffix
make it clear that it's something different than the other "strip"s,
without gratuitously adding more words for the same thing. And it would show up alphabetically near "strip", which is also good.
I'm going to try to refrain from further bike shedding this one --
I'd rather see something done than get my favorite names :-)
Only bikeshedding I'd do at this point is to go with Guido's suggestion to omit the underscores because the other methods don't have any.
On 07Mar2020 17:51, Christopher Barker <pythonchb@gmail.com> wrote:
On Sat, Mar 7, 2020 at 4:36 PM Cameron Simpson <cs@cskk.id.au> wrote:
Go's Trim strips multiple characters, but, as far as I can tell from the docs, TrimPrefix and TrimSuffix strip a single prefix/suffix string.
And right there is the confusion I'd rather avoid. I'd like the affix stuff to have a different name because it behaves differently.
Different than Go? or than existing Python methods? I think it's pretty much impossible for Python to remain in sync with other languages, we might well give up on that.
No, "trim" doing both lots characters or fixed strips depending on flavour of name. (And I'm against "trim" because of PHP's "trim" which is like our "strip".)
We've got strip, lstrip and rstrip with common "strip multiple
characters" actions. I think fixed affixes are different enough to warrant a differently spelled name.
That's why folks are suggesting "*prefix" and *suffix" -- the prefix and suffix make it pretty clear that something different is going on. And I at least think it's pretty descritive. I"ve lost track of what's on the table, but:
strip_prefix strip_suffix
make it clear that it's something different than the other "strip"s, without gratuitously adding more words for the same thing. And it would show up alphabetically near "strip", which is also good.
Yeah, they do make it more clear. I like "cutprefix" and "cutsuffix" myself.
I'm going to try to refrain from further bike shedding this one -- I'd rather see something done than get my favorite names :-)
Me too. Cheers, Cameron Simpson <cs@cskk.id.au>
Consider that the start or end of a string may contain repetitions of an affix. Should `-+-+-+Spam'.stripprefix('-+') remove just the first occurence? All of them? Does it need a 'count' parameter? [all modulo bikeshedding on the names of course] Rob Cliffe
Just the first occurrence. The vast majority of the time, that's what people want to do, and they will usually forget to add a 'count' parameter. Many people probably wouldn't even know it exists. It would be disastrous if code did the correct thing 99.9% of the time but occasionally silently mutilated a string. On Wed, Mar 18, 2020 at 8:06 PM Rob Cliffe via Python-ideas < python-ideas@python.org> wrote:
Consider that the start or end of a string may contain repetitions of an affix.
Should `-+-+-+Spam'.stripprefix('-+') remove just the first occurence? All of them? Does it need a 'count' parameter?
[all modulo bikeshedding on the names of course]
Rob Cliffe
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/JMWBL7... Code of Conduct: http://python.org/psf/codeofconduct/
Well, str.replace has a count parameter. Presumably people use it (even if by accidentally discovering that without it, it replaces all occurrences when they only wanted one replaced). On 18/03/2020 18:44, Alex Hall wrote:
Just the first occurrence. The vast majority of the time, that's what people want to do, and they will usually forget to add a 'count' parameter. Many people probably wouldn't even know it exists. It would be disastrous if code did the correct thing 99.9% of the time but occasionally silently mutilated a string.
On Wed, Mar 18, 2020 at 8:06 PM Rob Cliffe via Python-ideas <python-ideas@python.org <mailto:python-ideas@python.org>> wrote:
Consider that the start or end of a string may contain repetitions of an affix.
Should `-+-+-+Spam'.stripprefix('-+') remove just the first occurence? All of them? Does it need a 'count' parameter?
[all modulo bikeshedding on the names of course]
Rob Cliffe
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org <mailto:python-ideas@python.org> To unsubscribe send an email to python-ideas-leave@python.org <mailto:python-ideas-leave@python.org> https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/JMWBL7... Code of Conduct: http://python.org/psf/codeofconduct/
str.replace is in the opposite situation: 1. People usually want to replace everything. When I search for uses of replace in code, I can't find any uses of the count parameter. 2. People are more likely to accidentally notice that everything is being replaced the string being replaced is much more likely to appear multiple times, because: 1. It can appear anywhere in the string, whereas multiple consecutive prefixes is unlikely 2. It's often a single character, as opposed to something like a folder name. 3. I'm speculating here, but I think people are more likely to ask themselves "what if the string I want to replace appears multiple times" than "what if the prefix I want to remove appears multiple times in a row". That's a very edge edge case. On Wed, Mar 18, 2020 at 8:58 PM Rob Cliffe <rob.cliffe@btinternet.com> wrote:
Well, str.replace has a count parameter. Presumably people use it (even if by accidentally discovering that without it, it replaces all occurrences when they only wanted one replaced). On 18/03/2020 18:44, Alex Hall wrote:
Just the first occurrence. The vast majority of the time, that's what people want to do, and they will usually forget to add a 'count' parameter. Many people probably wouldn't even know it exists. It would be disastrous if code did the correct thing 99.9% of the time but occasionally silently mutilated a string.
On Wed, Mar 18, 2020 at 8:06 PM Rob Cliffe via Python-ideas < python-ideas@python.org> wrote:
Consider that the start or end of a string may contain repetitions of an affix.
Should `-+-+-+Spam'.stripprefix('-+') remove just the first occurence? All of them? Does it need a 'count' parameter?
[all modulo bikeshedding on the names of course]
Rob Cliffe
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/JMWBL7... Code of Conduct: http://python.org/psf/codeofconduct/
On 18 Mar 2020, at 18:03, Rob Cliffe via Python-ideas <python-ideas@python.org> wrote:
Consider that the start or end of a string may contain repetitions of an affix.
Should `-+-+-+Spam'.stripprefix('-+') remove just the first occurence? All of them? Does it need a 'count' parameter?
The only ways to use this function without counting is remove 1 prefix or remove all. As Alex said 1 prefix is the common case. For the all case there are existing ways to do it. If you are counting the number of prefix occurrences that exist you can simple slice the answer without the strip prefix function. Barry
[all modulo bikeshedding on the names of course]
A mauvey shade of purple.
Rob Cliffe
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/JMWBL7... Code of Conduct: http://python.org/psf/codeofconduct/
On 07Mar2020 15:31, Alex Hall <alex.mojaki@gmail.com> wrote:
I've defined functions like this in my own utility library and I use them all the time, so I think they're very useful and would like to seem them built in. But I have two functions for each end:
def strip_optional_suffix(string, suffix): """
strip_optional_suffix('abcdef', 'def') 'abc' strip_optional_suffix('abcdef', '123') 'abcdef' """ if string.endswith(suffix): return string[:-len(suffix)] return string
My utility library has them too, like this: def cutprefix(s, prefix): ''' Strip a `prefix` from the front of `s`. Return the suffix if `.startswith(prefix)`, else `s`. Example: >>> abc_def = 'abc.def' >>> cutprefix(abc_def, 'abc.') 'def' >>> cutprefix(abc_def, 'zzz.') 'abc.def' >>> cutprefix(abc_def, '.zzz') is abc_def True ''' if prefix and s.startswith(prefix): return s[len(prefix):] return s def cutsuffix(s, suffix): ''' Strip a `suffix` from the end of `s`. Return the prefix if `.endswith(suffix)`, else `s`. Example: >>> abc_def = 'abc.def' >>> cutsuffix(abc_def, '.def') 'abc' >>> cutsuffix(abc_def, '.zzz') 'abc.def' >>> cutsuffix(abc_def, '.zzz') is abc_def True ''' if suffix and s.endswith(suffix): return s[:-len(suffix)] return s Like yours, they return the original object if unchanged. I think yours misbehaved if given an empty suffix, which mine special cases out: >>> 'abc'[:-0] '' Cheers, Cameron Simpson <cs@cskk.id.au>
On Sat, Mar 7, 2020 at 3:01 PM Cameron Simpson <cs@cskk.id.au> wrote:
Like yours, they return the original object if unchanged.
that makes me uncomfortable, but I guess as srings are mutable (an may be interned, why not?) Do the other string methods return themselves if there is no change? As for the "required" idea -- an optional keyword only parameter could be useful, but definitely not the default behavior. That would be incompatible with all the other string methods. Though personally, I think I'd still check that a string "follows the rules" I expect outside this function anyway. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Sun, Mar 8, 2020 at 10:10 AM Christopher Barker <pythonchb@gmail.com> wrote:
On Sat, Mar 7, 2020 at 3:01 PM Cameron Simpson <cs@cskk.id.au> wrote:
Like yours, they return the original object if unchanged.
that makes me uncomfortable, but I guess as srings are mutable (an may be interned, why not?)
Do the other string methods return themselves if there is no change?
Yes. There's no reason not to; it's more efficient to return the same string, and perfectly safe to do so. Not every method returns itself when there's no change, but when it's easy to do, they do:
x = "hello world, this is a test" x.strip("@") is x True x.replace("@", "#") is x True x.zfill(5) is x True x.center(5) is x True
But:
x.lower() is x False x.lower() == x True
ChrisA
Can we just fix this? The lack of an obvious way to delete a prefix or suffix is a continual pain point for users of the language. We've just had yet another bug report from some poor user who misunderstood the lstrip method: https://bugs.python.org/issue39880 More examples: https://bugs.python.org/issue37114 https://bugs.python.org/issue36410 https://bugs.python.org/issue32772 https://bugs.python.org/issue25979 https://stackoverflow.com/questions/4148974/is-this-a-bug-in-python-2-7 https://stackoverflow.com/questions/34544247/understanding-pythons-lstrip-me... Obviously there will be a month of bike-shedding arguments about the names *wink* but can we at least agree that this is a genuine source of confusion and a useful addition to the string API? -- Steven
Yes. On Fri, Mar 6, 2020 at 2:19 PM Steven D'Aprano <steve@pearwood.info> wrote:
Can we just fix this? The lack of an obvious way to delete a prefix or suffix is a continual pain point for users of the language. We've just had yet another bug report from some poor user who misunderstood the lstrip method:
https://bugs.python.org/issue39880
More examples:
https://bugs.python.org/issue37114 https://bugs.python.org/issue36410 https://bugs.python.org/issue32772 https://bugs.python.org/issue25979
https://stackoverflow.com/questions/4148974/is-this-a-bug-in-python-2-7
https://stackoverflow.com/questions/34544247/understanding-pythons-lstrip-me...
Obviously there will be a month of bike-shedding arguments about the names *wink* but can we at least agree that this is a genuine source of confusion and a useful addition to the string API?
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/GRGAFI... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
On 03/06/2020 02:59 PM, Guido van Rossum wrote:
On Fri, Mar 6, 2020 at 2:19 PM Steven D'Aprano wrote:
Can we just fix this?
Obviously there will be a month of bike-shedding arguments about the names *wink* but can we at least agree that this is a genuine source of confusion and a useful addition to the string API?
Yes.
Excellent. I think we should have a `stripstr()` as an alias for strip, and a new `stripchr()`. And I'm perfectly okay with bytes() not having those methods. ;-) -- ~Ethan~
On Fri, Mar 06, 2020 at 03:33:49PM -0800, Ethan Furman wrote:
I think we should have a `stripstr()` as an alias for strip, and a new `stripchr()`.
Shouldn't they be the other way around? `strip` removes chars from a set of chars; the proposed method will remove a prefix/suffix.
And I'm perfectly okay with bytes() not having those methods. ;-)
If heavy users of bytes want these methods, they can request them separately. There's no backwards compatibility requirement for new string methods to be automatically added to bytes. I guess the question now is do we need a PEP? -- Steven
This will only need a PEP if the eventual proposal is controversial. :-) Someone (Andrew Barnert?) has claimed that another name, like cut or trim, would be too confusing. I’m not sure I agree. I think strip_prefix is more confusing, and stripstr or strstrip would unnecessarily cut off the option of adding this to bytes. (Since bytes may be used for file names I think they should get this new capability too.) On Fri, Mar 6, 2020 at 16:10 Steven D'Aprano <steve@pearwood.info> wrote:
On Fri, Mar 06, 2020 at 03:33:49PM -0800, Ethan Furman wrote:
I think we should have a `stripstr()` as an alias for strip, and a new `stripchr()`.
Shouldn't they be the other way around?
`strip` removes chars from a set of chars; the proposed method will remove a prefix/suffix.
And I'm perfectly okay with bytes() not having those methods. ;-)
If heavy users of bytes want these methods, they can request them separately. There's no backwards compatibility requirement for new string methods to be automatically added to bytes.
I guess the question now is do we need a PEP?
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/I2PNEQ... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido (mobile)
(Since bytes may be used for file names I think they should get this new capability too.) I don’t really care one way or another, but is it really still the case that bytes need to be used for filenames? For uses other than just passing them around? Sigh. In any case, while it’s fine to consider the bytes issue in choosing a name, I hope it doesn’t derail the whole idea. -CHB
On Fri, Mar 6, 2020 at 16:10 Steven D'Aprano <steve@pearwood.info> wrote:
On Fri, Mar 06, 2020 at 03:33:49PM -0800, Ethan Furman wrote:
I think we should have a `stripstr()` as an alias for strip, and a new `stripchr()`.
Shouldn't they be the other way around?
`strip` removes chars from a set of chars; the proposed method will remove a prefix/suffix.
And I'm perfectly okay with bytes() not having those methods. ;-)
If heavy users of bytes want these methods, they can request them separately. There's no backwards compatibility requirement for new string methods to be automatically added to bytes.
I guess the question now is do we need a PEP?
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/I2PNEQ... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido (mobile) _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/SMB6AI... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Fri, Mar 6, 2020 at 5:46 PM Christopher Barker <pythonchb@gmail.com> wrote:
(Since bytes may be used for file names I think they should get this new capability too.)
I don’t really care one way or another, but is it really still the case that bytes need to be used for filenames? For uses other than just passing them around?
Yes, Linux in particular does not guarantee that file names are using any particular encoding (let alone a consistent encoding for different files). The only two bytes that are special are '\0' and '/'.
Sigh.
Indeed, especially since macOS *does* guarantee that filenames are Unicode (even using a specific normalization) and Windows represents filenames internally as UTF-16, IIRC.
In any case, while it’s fine to consider the bytes issue in choosing a name, I hope it doesn’t derail the whole idea.
I didn't like the name stripstr anyway. :-)
-CHB
On Fri, Mar 6, 2020 at 16:10 Steven D'Aprano <steve@pearwood.info> wrote:
On Fri, Mar 06, 2020 at 03:33:49PM -0800, Ethan Furman wrote:
I think we should have a `stripstr()` as an alias for strip, and a new `stripchr()`.
Shouldn't they be the other way around?
`strip` removes chars from a set of chars; the proposed method will remove a prefix/suffix.
And I'm perfectly okay with bytes() not having those methods. ;-)
If heavy users of bytes want these methods, they can request them separately. There's no backwards compatibility requirement for new string methods to be automatically added to bytes.
I guess the question now is do we need a PEP?
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
On Fri, Mar 6, 2020 at 5:54 PM Guido van Rossum <guido@python.org> wrote:
(Since bytes may be used for file names I think they should get this new capability too.)
I don’t really care one way or another, but is it really still the case that bytes need to be used for filenames? For uses other than just passing them around?
Yes, Linux in particular does not guarantee that file names are using any particular encoding (let alone a consistent encoding for different files). The only two bytes that are special are '\0' and '/'.
I *think* I understand the issues. And I can see that some software would need to work with filenames as arbitrary bytes. But that doesn't mean that you can do much with them that way. I can see filename.split(b'/') for instance, but how could you strip a prefix or suffix without knowing the encoding? filename.strip_suffix(b'.txt') would only work for ASCII-compaitble encodings. There's no way around the fact that you have to make SOME assumptions about the encoding if you are going to do anything other than pass it around or work with the b'/' byte. And if that's the case, then you might as well decode and use 'surrogateescape' so the program won't crash. Getting OT, but I do wonder if we should continue to support (and therefor encourage) the use of bytes in inappropriate ways. I didn't like the name stripstr anyway. :-)
Neither do I, so I guess I shouldn't have brought this up ... -CHB
On Fri, Mar 6, 2020 at 16:10 Steven D'Aprano <steve@pearwood.info> wrote:
On Fri, Mar 06, 2020 at 03:33:49PM -0800, Ethan Furman wrote:
I think we should have a `stripstr()` as an alias for strip, and a new `stripchr()`.
Shouldn't they be the other way around?
`strip` removes chars from a set of chars; the proposed method will remove a prefix/suffix.
And I'm perfectly okay with bytes() not having those methods. ;-)
If heavy users of bytes want these methods, they can request them separately. There's no backwards compatibility requirement for new string methods to be automatically added to bytes.
I guess the question now is do we need a PEP?
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On 07Mar2020 15:01, Christopher Barker <pythonchb@gmail.com> wrote:
On Fri, Mar 6, 2020 at 5:54 PM Guido van Rossum <guido@python.org> wrote:
(Since bytes may be used for file names I think they should get this new capability too.)
I don’t really care one way or another, but is it really still the case that bytes need to be used for filenames? For uses other than just passing them around?
Yes, Linux in particular does not guarantee that file names are using any particular encoding (let alone a consistent encoding for different files). The only two bytes that are special are '\0' and '/'.
I *think* I understand the issues. And I can see that some software would need to work with filenames as arbitrary bytes. But that doesn't mean that you can do much with them that way.
Given that the entire UNIX filename API is bytes, I think this isn't very true.
I can see filename.split(b'/') for instance, but how could you strip a prefix or suffix without knowing the encoding?
Well, directly: filename.cutsuffix(b'.abc') But more seriously, you're either treating them as bytes with no particular encoding and the above just means "remove these 4 bytes" or you do know the encoding and are working with strings, so you'd either have a string andcut a string, or have bytes and cut the value '.abc'.encode(encoding=known_encoding). Things like listdir are dual mode: call it with a bytes directory name and you get bytes results, call it with a string directory name and you get string results. There's some funky encoding accomodation in there (read the docs, it's a little subtlety to do with returning strings which didn't decode cleanly from the underlying bytes).
filename.strip_suffix(b'.txt') would only work for ASCII-compaitble encodings.
Or b'.txt' is your known bytes encoding of some known string suffix in your working encoding. But like the other string-like bytes methods, I think there's a good case for supporting bytes prefixes and suffixes; it is just a matter of using the correct bytes affix in the regime you're working in. Might not be filenames, either.
There's no way around the fact that you have to make SOME assumptions about the encoding if you are going to do anything other than pass it around or work with the b'/' byte.
They needn't be assumptions; all code has some outer context.
And if that's the case, then you might as well decode and use 'surrogateescape' so the program won't crash.
Ah, I see you've encountered the listdir-return-string stuff already then.
Getting OT, but I do wonder if we should continue to support (and therefor encourage) the use of bytes in inappropriate ways.
I think there's plenty of reasonable bytes actions which look a lot like string actions, and are not confusing. Consider this contrived example: payload_bytes = packet_bytes.cutprefix(header_bytes) There was an interesting writeup by a guy involved in the mercurial Python 3 port where he discusses the pain which came with the bytes type lacking a lot of the string support methods when Python 3 first came out. He suggests a lot of things would have gone far smoother with these, as Mercurial had a lot of filenames-as-bytes-strings inside. Here we are: https://gregoryszorc.com/blog/2020/01/13/mercurial%27s-journey-to-and-reflec... Personally I lean the other way, and welcomed the initial lack of stringish methods as a good way to uncover bytes mistakenly used for strings. But I see his point. Cheers, Cameron Simpson <cs@cskk.id.au>
On Sat, Mar 7, 2020, at 19:31, Cameron Simpson wrote:
I *think* I understand the issues. And I can see that some software would need to work with filenames as arbitrary bytes. But that doesn't mean that you can do much with them that way.
Given that the entire UNIX filename API is bytes, I think this isn't very true.
Most real-world UNIX systems only support ASCII-compatible encodings. There's no reason not to solve the problem on such systems by using os.fsdecode(). On those few that do not (of which I don't know if any support *both* ASCII and non-ASCII-compatible encodings in locales - from what I can find, those that don't use ASCII-compatible encodings tend to exclusively use EBCDIC ones) I don't know how they handle these cases, or if python even supports any of them at all, but it seems likely that b'/' will not be the same byte as the path separator.
Most real-world UNIX systems only support ASCII-compatible encodings. There's no reason not to solve the problem on such systems by using os.fsdecode().
Huh?! Is my Ubuntu derivative not "real world"? 666-tmp % uname -a Linux popkdm 5.3.0-7629-generic #31~1581628825~19.10~f90b7d5-Ubuntu SMP Fri Feb 14 19:56:45 UTC x86_64 x86_64 x86_64 GNU/Linux 667-tmp % touch ✗—Not-ASCII 668-tmp % ls ✗* ✗—Not-ASCII 672-tmp % ls ✗* | hexdump -C 00000000 e2 9c 97 e2 80 94 4e 6f 74 2d 41 53 43 49 49 0a |......Not-ASCII.| 00000010 -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On 10/03/2020 14:58, David Mertz wrote:
Most real-world UNIX systems only support ASCII-compatible encodings. There's no reason not to solve the problem on such systems by using os.fsdecode().
Huh?!
Is my Ubuntu derivative not "real world"?
666-tmp % uname -a Linux popkdm 5.3.0-7629-generic #31~1581628825~19.10~f90b7d5-Ubuntu SMP Fri Feb 14 19:56:45 UTC x86_64 x86_64 x86_64 GNU/Linux 667-tmp % touch ✗—Not-ASCII 668-tmp % ls ✗* ✗—Not-ASCII
672-tmp % ls ✗* | hexdump -C 00000000 e2 9c 97 e2 80 94 4e 6f 74 2d 41 53 43 49 49 0a |......Not-ASCII.| 00000010
Yes, but it is ASCII-compatible; ASCII characters are encoded as their 7-bit ASCII values. I'm not sure this is a particularly useful observation, mind you. -- Rhodri James *-* Kynesim Ltd
On Mar 10, 2020, at 08:01, David Mertz <mertz@gnosis.cx> wrote:
Most real-world UNIX systems only support ASCII-compatible encodings. There's no reason not to solve the problem on such systems by using os.fsdecode().
Huh?!
Is my Ubuntu derivative not "real world"?
666-tmp % uname -a Linux popkdm 5.3.0-7629-generic #31~1581628825~19.10~f90b7d5-Ubuntu SMP Fri Feb 14 19:56:45 UTC x86_64 x86_64 x86_64 GNU/Linux 667-tmp % touch ✗—Not-ASCII 668-tmp % ls ✗* ✗—Not-ASCII
Technically your Ubuntu derivative is not a real-world UNIX system, because it’s not a UNIX system. Only a handful of Linux distros bother to be certified, because it’s not worth the cost unless you need to sell to some corporate or government department who have some regulation requiring UNIX. And practically, I’m pretty sure that’s UTF-8, which is ASCII-compatible: every byte from 0-127 always means the same thing as it does in ASCII. This means you can, e.g., do path.split(os.pathsep.encode('ascii')) and know you’re getting the right behavior. The same thing works for Latin-1 and friends, and the IBM code pages in the “extended ASCII” group, and so on—those are the kinds of things Random was presumably talking about, because they are commonly used in real-world UNIX systems. There are also things that are not ASCII-compatible but are close. For example, in Shift-JIS, a couple low bytes have a different meaning than in ASCII, and many of them can also appear as part of a 2-byte character—but ASCII NUL and slash still always mean NUL and slash, so you can use it for your Linux filesystems. (Although you will have a lot of trouble in the shell, because your backslash escape is now a yen escape, and 64 other characters have the same byte invisibly as their second byte.) Things that are not even that ASCII-compatible include UTF-16, EBCDIC code pages, 80s Atari encoding, etc.; they are not commonly used in real-world UNIX systems. Which I think was Random’s point.
Getting a bit OT, but I *think* this is the story: I've heard it argued, by folks that want to write Python software that uses bytes for filenames, that: A file path on a *nix system can be any string of bytes, except two special values: b'\x00' : null b'\x2f' : slash (consistent with this SO post, among many other sources: https://unix.stackexchange.com/questions/39175/understanding-unix-file-name-... ) So any encoding will work, as long as those two values mean the right thing. Practically, null is always null, so that leaves the slash So any encoding that uses b'\x2f' for the slash would work. Which seems to include, for instance, UTF-16: In [31]: "/".encode('utf-16') Out[31]: b'\xff\xfe/\x00' In [40]: [hex(b) for b in "/".encode('utf-16')] Out[40]: ['0xff', '0xfe', '0x2f', '0x0'] However, if one were to actually use that in raw form, and, for instance, split on the \x2f byte, you wouldn't get anything useful. In [53]: first, second = "first_part/second_part".encode('utf-16').split(b'/') In [54]: first.decode('utf-16') Out[54]: 'first_part' In [55]: second.decode('utf-16') --------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) <ipython-input-55-9eec3a9ebb3d> in <module> ----> 1 second.decode('utf-16') UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x00 in position 22: truncated data In practice, I suspect that every *nix system uses encoding(s) that are ASCII-compatible for the first 127 values, or more precisely that have the slash character value be a single byte of value: 2f. And as long as that's the case (and that value doesn't show up anywhere else) then software can know nothing about the encoding, and still do two things: pass it around split on the slash. Which may be enough for various system tools. So a fine argument for facilitating their use in things like globbing directories, opening files, etc. But as soon as you have any interaction with humans, then filenames need to be human meaningful. and as soon as you manipulate the names beyond splitting and merging on a slash, then you do need to know something about the encoding. In practice, maybe knowing that it's ascii compatible in the first 127 bytes will get pretty far, as you can do things like: if filename_bytes.endswith(b'.txt'): root_name = filename_bytes[:-4] So adding the new stripsuffix or whatever we call it makes sense. However: As soon as someone wants to do anything even a bit more sophisticated, that may involve non-ascii characters, that would all go to heck. And my understanding is that with the 'surrogateescape' error handlers, you can convert to a "maybe right" encoding, manipulate it, and then convert back, using the same encoding. Though this still goes to heck if the encoding uses more than one byte for the slash. (or a surrogate escape is part of some other manipulation you may do). Anyway -- this is why it seems like a bad idea to give the bytes object any more "string like" functionality. But bytes has a pretty full set of "string like" methods now, so I suppose it makes sense to add a couple new ones that are related to ones that are already there. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Wed, Mar 11, 2020 at 7:19 AM Christopher Barker <pythonchb@gmail.com> wrote:
Getting a bit OT, but I *think* this is the story:
I've heard it argued, by folks that want to write Python software that uses bytes for filenames, that:
A file path on a *nix system can be any string of bytes, except two special values:
b'\x00' : null b'\x2f' : slash
(consistent with this SO post, among many other sources: https://unix.stackexchange.com/questions/39175/understanding-unix-file-name-...)
So any encoding will work, as long as those two values mean the right thing. Practically, null is always null, so that leaves the slash
So any encoding that uses b'\x2f' for the slash would work. Which seems to include, for instance, UTF-16:
In [31]: "/".encode('utf-16') Out[31]: b'\xff\xfe/\x00'
Nope, see above about b'\x00' :)
In practice, maybe knowing that it's ascii compatible in the first 127 bytes will get pretty far...
That's exactly what "ASCII compatible" means. Since ASCII is a seven-bit encoding, an encoding is ASCII-compatible if (a) every ASCII character is represented by the corresponding byte value, and (b) every seven-bit value represents that ASCII character. ChrisA
On Wed, Mar 11, 2020 at 07:28:06AM +1100, Chris Angelico wrote:
That's exactly what "ASCII compatible" means. Since ASCII is a seven-bit encoding, an encoding is ASCII-compatible if (a) every ASCII character is represented by the corresponding byte value, and (b) every seven-bit value represents that ASCII character.
Sorry Chris, that explanation left me more confused than I started :-( Let me have a go... The ASCII encoding is a mapping between *seven-bit numeric values* and 128 distinct characters, some of which are human-readable: A = 1000001 B = 1000010 a = 1100001 and some of which are considered to be "binary" characters: NUL = 0000000 SOH = 0000001 DEL = 1111111 In practice today, seven bits are inconvenient, so these are always padded with a leading 0 bit. An encoding is compatible with ASCII if, and only if, the following is true: * all 128 of the ASCII characters are handled by the encoding; * each of those characters are mapped to the same eight-bit value as the ASCII encoding would use (including the leading 0 bit); * no non-ASCII character is mapped to one of those eight-bit values; * or to something which could be confused with one of those eight-bit values by a naive application that processed them a byte at a time. E.g. if an encoding mapped some character ∇ to the 16-bit value: 01000001 11110000 that would not be considered ASCII-compatible, because the first byte would be interpreted as "A" by a naive application. Most (all?) of the "extended ASCII" eight-bit encodings are ASCII compatible, because they use only bytes with a leading 1 for the non-ASCII characters. UTF-8 is also ASCII compatible. UTF-16 and UTF-32 are *not* ASCII compatible. How did I go? -- Steven
On Wed, Mar 11, 2020 at 9:28 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Mar 11, 2020 at 07:28:06AM +1100, Chris Angelico wrote:
That's exactly what "ASCII compatible" means. Since ASCII is a seven-bit encoding, an encoding is ASCII-compatible if (a) every ASCII character is represented by the corresponding byte value, and (b) every seven-bit value represents that ASCII character.
Sorry Chris, that explanation left me more confused than I started :-(
Let me have a go...
The ASCII encoding is a mapping between *seven-bit numeric values* and 128 distinct characters, some of which are human-readable:
A = 1000001 B = 1000010 a = 1100001
and some of which are considered to be "binary" characters:
NUL = 0000000 SOH = 0000001 DEL = 1111111
Correct.
In practice today, seven bits are inconvenient, so these are always padded with a leading 0 bit.
Yes, since there's no practical way to represent ASCII characters in seven-bit units, so we store those numbers in eight-bit bytes.
An encoding is compatible with ASCII if, and only if, the following is true:
* all 128 of the ASCII characters are handled by the encoding;
* each of those characters are mapped to the same eight-bit value as the ASCII encoding would use (including the leading 0 bit);
Correct - this is my "(a)" condition
* no non-ASCII character is mapped to one of those eight-bit values;
* or to something which could be confused with one of those eight-bit values by a naive application that processed them a byte at a time.
And corect - this is my "(b)" condition. Any byte value below 128 must represent the corresponding ASCII character, and nothing else.
E.g. if an encoding mapped some character ∇ to the 16-bit value:
01000001 11110000
that would not be considered ASCII-compatible, because the first byte would be interpreted as "A" by a naive application.
Exactly.
Most (all?) of the "extended ASCII" eight-bit encodings are ASCII compatible, because they use only bytes with a leading 1 for the non-ASCII characters.
Right. ASCII-compatible and a single-byte encoding, simple, straight-forward, and easy to work with. But, of course, limited to just 128 non-ASCII characters.
UTF-8 is also ASCII compatible.
UTF-16 and UTF-32 are *not* ASCII compatible.
How did I go?
Nailed it. And explained it far more clearly than I did. ChrisA
On Mar 10, 2020, at 13:18, Christopher Barker <pythonchb@gmail.com> wrote:
Getting a bit OT, but I *think* this is the story:
I've heard it argued, by folks that want to write Python software that uses bytes for filenames, that:
A file path on a *nix system can be any string of bytes, except two special values:
b'\x00' : null b'\x2f' : slash
(consistent with this SO post, among many other sources: https://unix.stackexchange.com/questions/39175/understanding-unix-file-name-...)
So any encoding will work, as long as those two values mean the right thing. Practically, null is always null, so that leaves the slash
No; there are plenty of encodings where a 0 byte doesn’t always mean NUL. And in fact, that’s exactly the problem with UTF-16: every ASCII character in UTF-16 is the same byte preceded or followed (depending on endianness) by a 0 byte. And you aren’t allowed to have arbitrary 0 bytes like that in your paths.
So any encoding that uses b'\x2f' for the slash would work.
Even besides the zero problem, it has to not only always use 0x2f for slash, but also never use 0x2f for anything else. This was a problem for many earlier East Asian encodings, where a slash is 0x2f, but some kanji character is also 0x93 0x2f, or some kana character is 0x2f after a mode shift, etc. In such cases, every 0x2f byte gets treated as a path separator, even the ones that don’t mean slash. There are encodings that are not ASCII compatible that nevertheless guarantee that 0x00 always means NUL and vice versus and that 0x2f always means slash and vice-versa, like Shift-JIS. Many of them will cause problems in the shell, file manager GUIs, etc., but that’s a different part of the specification (and Unix already allows you to have non printable, etc. characters in file names, so that problem is there even with ASCII). Many of them also aren’t usable for pathnames on other platforms (e.g., Shift-JIS does guarantee that 0x2f always means slash, but 0x5c doesn’t always mean backslash; it means yen or the second half of various kanji, so you don’t want to use it for byte paths on Windows). But for Unix pathnames, they are usable. But again, UTF-16 is not one of them.
Which seems to include, for instance, UTF-16:
In [31]: "/".encode('utf-16') Out[31]: b'\xff\xfe/\x00'
In this case, you will get very lucky—or, maybe better, unlucky. This is illegal, but in practice no API can detect that it’s illegal, because all of the POSIX and libc functions and most third-party functions just take a null-terminated string, meaning they silently truncate right after the first Latin-1 character, and your string is exactly one Latin-1 character long.
Christopher, I'm not sure how much of the following you already know, so excuse me in advance if I'm covering old ground for you. But hopefully it will be helpful for someone! On Tue, Mar 10, 2020 at 01:18:22PM -0700, Christopher Barker wrote:
Getting a bit OT, but I *think* this is the story:
I've heard it argued, by folks that want to write Python software that uses bytes for filenames, that:
A file path on a *nix system can be any string of bytes, except two special values:
b'\x00' : null b'\x2f' : slash
To be precise: file *names* cannot contain null bytes or slashes. Paths can contain slashes, which represent directory separators. To be even more precise: technically this depends on the file system, not the OS. There are a tiny handful of file systems that support NULs and/or slashes in file names, although whether you could actually operate on those files in practice is another story. In practice, the prohibition on NUL and slash is baked so deeply into the Unix world at all levels (OS, shells, applications) that even if you had a file system that supported them, I doubt you would be able to use those characters in file names.
So any encoding will work, as long as those two values mean the right thing. Practically, null is always null, so that leaves the slash So any encoding that uses b'\x2f' for the slash would work. Which seems to include, for instance, UTF-16:
In [31]: "/".encode('utf-16')
Out[31]: b'\xff\xfe/\x00'
You probably don't want the UTF-16 BOM (byte-order-mark) in the name. So you want "/".encode('utf-16le') or perhaps 'utf-16be'. But either way, that won't work, because the path contains a NUL byte. Let's be concrete. Both of these are fine: >>> open('/tmp/spam', 'w') <_io.TextIOWrapper name='/tmp/spam' mode='w' encoding='UTF-8'> >>> open(b'/tmp/spam', 'w') <_io.TextIOWrapper name=b'/tmp/spam' mode='w' encoding='UTF-8'> Both of those represent the same file, because: >>> '/tmp/spam'.encode('utf-8') == b'/tmp/spam' True However, if I use UTF-16, it fails because the file name and path contains NUL bytes: >>> path = '/tmp/spam'.encode('utf-16be') >>> print(path) b'\x00/\x00t\x00m\x00p\x00/\x00s\x00p\x00a\x00m' >>> f = open(path, 'w') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: embedded null byte Here are some fun and games with Unicode... There's no way to get the file b'/tmp/spam' from a UTF-16 string since it has an odd number of bytes, but I can get the file '/tmp/ spam' (note the leading space!) like this: name = ('\N{CJK UNIFIED IDEOGRAPH-742F}' '\N{CJK UNIFIED IDEOGRAPH-706D}' '\N{NARROW NO-BREAK SPACE}' '\N{CJK UNIFIED IDEOGRAPH-7073}' '\N{CJK UNIFIED IDEOGRAPH-6D61}') open(name.encode('utf-16le'), 'w') That will open the file b'/tmp/ spam'. If that doesn't make you pine for the simpler days when the whole computing world used nothing but American English and liked it, then nothing will :-) [...]
In practice, I suspect that every *nix system uses encoding(s) that are ASCII-compatible for the first 127 values, or more precisely that have the slash character value be a single byte of value: 2f. And as long as that's the case (and that value doesn't show up anywhere else) then software can know nothing about the encoding, and still do two things:
pass it around split on the slash.
In practice, modern Unix shells and GUIs use UTF-8. UTF-8 has two nice properties: * Every ASCII character encodes to a single byte, so text which only contains ASCII values encodes to precisely the same set of bytes under UTF-8 as under ASCII. * No Unicode character, except for the Unicode NUL '\0', encodes to a sequence containing a null byte. These properties are not an accident -- they were carefully designed that way. In practice, the Unix OS doesn't care what encoding the user interface uses. It deals in bytes, and that's all it cares about. Only the interface cares about the encoding, because if you name a file "Mετăl" in the shell, you expect to see the same file name in the open file dialog of all your GUI applications. (And visa versa.) So all the various shells, GUIs etc have to agree to use the same encoding or users get mad.
Which may be enough for various system tools. So a fine argument for facilitating their use in things like globbing directories, opening files, etc.
But as soon as you have any interaction with humans, then filenames need to be human meaningful. and as soon as you manipulate the names beyond splitting and merging on a slash, then you do need to know something about the encoding.
Well... yes and no. In practice, Unix users don't care any more about encodings than Windows users do. Possibly less! Windows users have to deal with legacy systems that can use any of the hundreds of pre-Unicode "extended ASCII" 8 bit systems. If Stephen Turnbull is reading this, he can probably tell you scary stories about the chaos in pre-Unicode Japanese encodings like Big-5 and Shift-JS. In the Linux/BSD world, pretty much everyone uses UTF-8 unless you're doing something unusual, like trying to extract data from some old Russian CSV file or ancient Macintosh text file. In the shell, I couldn't care less about the encoding. I just name files whatever I want, and the shell deals with them: [steve@ando ~]$ touch Mετăl [steve@ando ~]$ ls -l M*l -rw-rw-r-- 1 steve steve 0 Mar 11 20:46 Mετăl (Alas, typing those non-ASCII characters is a PITA, I ended up having to enter them using a GUI "Character Map" application and paste them into the shell.) In Python, I never worry about encodings. I just use regular old Unicode strings: open('Mετăl') and it Just Works. I would expect that the majority of Unix users will be in the same boat. I think that the exception will be people writing applications that have to straddle the low-level "Unix file names are bytes" and high-level "you can use anything you can type as a file name" worlds. But bytes are useful for more than just file names! Anyone writing a binary file format needs to deal with bytes, and I'm confident that there are binary formats that have optional prefixes and suffixes that might need to be stripped before doing further processing. if chunk.startswith(b'DEADBEEF'): chunk = chunk[8:] process(chunk) One possible example: you have some binary data which may or may not have a NUL byte at the end. The NUL byte is redundant in Python (we're not C) so you want to delete it: if data.endswith(b'\0'): data = data[:-1] -- Steven
On Wed, Mar 11, 2020 at 9:05 PM Steven D'Aprano <steve@pearwood.info> wrote:
In practice, modern Unix shells and GUIs use UTF-8. UTF-8 has two nice properties:
* Every ASCII character encodes to a single byte, so text which only contains ASCII values encodes to precisely the same set of bytes under UTF-8 as under ASCII.
* No Unicode character, except for the Unicode NUL '\0', encodes to a sequence containing a null byte.
These properties are not an accident -- they were carefully designed that way.
The second of those is actually part of an even stronger guarantee: No Unicode character except for an ASCII character encodes to a sequence containing a byte less than 128. In other words, the ASCII characters U+0000 to U+007F perfectly correspond to the byte values 0x00 to 0x7F, and *no other UTF-8 sequence* will ever contain one of those byte values. This makes parsing an ASCII-only file format easy. You don't have to worry about, for instance, finding a bye value 0x3C unless it represents "<". (Though if you're taking a more generic boundary like "whitespace", you'll need to cope with more than just bytes. But for something like HTML, this is safe.) Other ASCII-compatible encodings make the same guarantees, although a lot of them do this by having only 128 non-ASCII characters available. ChrisA
On Mar 11, 2020, at 03:07, Steven D'Aprano <steve@pearwood.info> wrote:
But bytes are useful for more than just file names!
The paradigm example of this is HTTP. It’s mostly people working on HTTP clients, servers, middleware, and apps who pushed for the bytes methods in Python 3.x. IIRC, the PEP for bytes.__mod__ (461) had links to a lot of discussion and history. But it’s probably not an exaggeration to say that if you couldn’t parse HTTP headers as bytes with split, strip, etc. (and maybe bytes regexes as well), the entire 3.x transition would have gone a lot worse. And Python itself has been doing something similar (if simpler) since the early 2.x days to find the source content encoding. Now that we have surrogateescape, maybe you could go back and redo all that code with str methods, but it would be less efficient, harder rather than easier to follow, just as easy to get wrong, and harder to debug. (I recently tried something similar with a parser for the rigid-language text chunks in a binary chunked file format.) That being said, none of this means the new methods necessarily have to be added to bytes. I think the bar is higher. Writing your own split is a daunting task for a novice, and easy to get wrong, so bytes.split makes sense. But a prefix stripper, that’s more a convenience than a must-have, and it might well be convenient enough for str but not quite enough for bytes. This thread has demonstrated people reinventing this wheel over and over in the wild for str, but has anyone found examples of people doing it for bytes?
On 10/03/2020 20:18, Christopher Barker wrote: ...much about file naming in theory and practice under Unix, concluding with:
But bytes has a pretty full set of "string like" methods now, so I suppose it makes sense to add a couple new ones that are related to ones that are already there.
I disagree. We've headed off down the rabbit-hole of filenames for justification here, but surely pathlib is the correct tool if you are going to be chopping up filenames and path names? It already gives us OS-specific behaviour and the sort of partitioning of name elements that seem to be 90% of what people are asking for. -- Rhodri James *-* Kynesim Ltd
Rhodri James writes:
We've headed off down the rabbit-hole of filenames for justification here, but surely pathlib is the correct tool if you are going to be chopping up filenames and path names?
This isn't obvious to me. The majority of people (among those for whom my respect is "very high" or better) who hate Python 3 are people who spend much of their effort on byteslinging applications (Twisted and Mercurial come immediately to mind). I don't know if *they* think these APIs would be more useful than pathlib for them, but it's not obvious to me the APIs are *not* useful. I'm thinking of things like RFC 822-like headers, URI schemes, REST endpoints, yada yada yada. We should ask *them*. (By "we" I mean the proponents of the new APIs.)
It already gives us OS-specific behaviour and the sort of partitioning of name elements that seem to be 90% of what people are asking for.
I tend to agree, but I don't know what the byteslingers want/need, because I'm a text-oriented kinda guy. Maybe you know better, if you're a byteslinger, I'll take your word for it. I still think we should ask some of the folks I mentioned above. Steve
On 11/03/2020 18:45, Stephen J. Turnbull wrote:
Rhodri James writes:
We've headed off down the rabbit-hole of filenames for justification here, but surely pathlib is the correct tool if you are going to be chopping up filenames and path names?
This isn't obvious to me. The majority of people (among those for whom my respect is "very high" or better) who hate Python 3 are people who spend much of their effort on byteslinging applications (Twisted and Mercurial come immediately to mind). I don't know if *they* think these APIs would be more useful than pathlib for them, but it's not obvious to me the APIs are *not* useful. I'm thinking of things like RFC 822-like headers, URI schemes, REST endpoints, yada yada yada.
We should ask *them*. (By "we" I mean the proponents of the new APIs.)
That's fair. I don't deal with bytes in a way that prefixing or suffixing is any use for. I think even in the cgi module all the hairy parsing gets done elsewhere. I'm more concerned with how the original discussion seems to have obsessed about wanting prefix/suffix trimming to deal with filenames without seeming to have considered whether it's actually the right answer.
It already gives us OS-specific behaviour and the sort of partitioning of name elements that seem to be 90% of what people are asking for.
I tend to agree, but I don't know what the byteslingers want/need, because I'm a text-oriented kinda guy. Maybe you know better, if you're a byteslinger, I'll take your word for it. I still think we should ask some of the folks I mentioned above.
I'm not that sort of byteslinger, but I'm completely prepared to believe they exist in numbers! -- Rhodri James *-* Kynesim Ltd
On 11 Mar 2020, at 19:03, Rhodri James <rhodri@kynesim.co.uk> wrote:
On 11/03/2020 18:45, Stephen J. Turnbull wrote:
We've headed off down the rabbit-hole of filenames for justification here, but surely pathlib is the correct tool if you are going to be chopping up filenames and path names? This isn't obvious to me. The majority of people (among those for whom my respect is "very high" or better) who hate Python 3 are people who spend much of their effort on byteslinging applications (Twisted and Mercurial come immediately to mind). I don't know if *they* think
Rhodri James writes: these APIs would be more useful than pathlib for them, but it's not obvious to me the APIs are *not* useful. I'm thinking of things like RFC 822-like headers, URI schemes, REST endpoints, yada yada yada. We should ask *them*. (By "we" I mean the proponents of the new APIs.)
That's fair. I don't deal with bytes in a way that prefixing or suffixing is any use for. I think even in the cgi module all the hairy parsing gets done elsewhere. I'm more concerned with how the original discussion seems to have obsessed about wanting prefix/suffix trimming to deal with filenames without seeming to have considered whether it's actually the right answer.
It already gives us OS-specific behaviour and the sort of partitioning of name elements that seem to be 90% of what people are asking for. I tend to agree, but I don't know what the byteslingers want/need, because I'm a text-oriented kinda guy. Maybe you know better, if you're a byteslinger, I'll take your word for it. I still think we should ask some of the folks I mentioned above.
I'm not that sort of byteslinger, but I'm completely prepared to believe they exist in numbers!
I do byte slinging in python as my current and previous day jobs. In the last job we had an appliance that allows upload of files from HTTP POST and FTP. We ran the file system with utf-8 encoding and the interface in unicode. But, I recall, we found out that FTP allows the client to set the encoding and that lead to us having to process filenames very carefully as they where not always uft-8 and we handled this as a special case and allowed the user to rename the files from byte-filename into a nice unicode name. I cannot remember if the USB sticks formatted on Windows suffered from encoding problems. Not sure if we have code in the current jobs code base to strip prefix/suffix, but would not be surprised. As almost all the data comes form HTTP we process in bytes. The cost of converting to and from unicode would drop the TPS rate more then we could bare. If the prefix/suffix functions are added to unicode then it would certainly be nice to have the same API for bytes and I'd use them. But I can always maintain our our functions to do this (might be doing that already). Barry
-- Rhodri James *-* Kynesim Ltd _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/6XNHB2... Code of Conduct: http://python.org/psf/codeofconduct/
On Wed, Mar 11, 2020 at 6:59 AM Rhodri James <rhodri@kynesim.co.uk> wrote:
I disagree. We've headed off down the rabbit-hole of filenames for justification here, but surely pathlib is the correct tool if you are going to be chopping up filenames and path names?
Does pathlib work correctly for paths in unknown (and mixed) encodings? Personally, I'm happy to consider filenames in an arbir=tray encoding "broken", but I don't write the kind of system tools where you can't do that. But anyway, it's a bi tof red herring -- as others' posted -- if the "byte slingers" would find it useful, then that's all we need to know. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On 03/06/2020 04:07 PM, Steven D'Aprano wrote:
On Fri, Mar 06, 2020 at 03:33:49PM -0800, Ethan Furman wrote:
I think we should have a `stripstr()` as an alias for strip, and a new `stripchr()`.
Shouldn't they be the other way around?
`strip` removes chars from a set of chars; the proposed method will remove a prefix/suffix.
Um, yeah. Thanks for catching that. -- ~Ethan~
One thing I love about .startswith() and .endswith() is matching multiple options. It's a little funny the multiple options must be a tuple exactly (not a list, not a set, not an iterator), but whatever. It would be about to lack that symmetry in the .cut_suffix() method.
E.g now:
if fname.endswith(('.jpg', '.png', '.gif)): ...
I'd expect to be able to do:
basename = fname.cut_suffix(('.jpg', '.png', '.gif))
An idea worth considering: one can think of the “strip” family of methods as currently taking an iterable of strings as an argument (since a string is itself an sequence of strings):
"abcd".rstrip("dc") 'ab'
It would not be a huge logical leap to allow them to take any iterable. Backward compatible, no new methods:
fname.rstrip(('.jpg', '.png', '.gif'))
It even, in my opinion, can clarify "classic" strip/rstrip/lstrip usage:
"abcd".rstrip(("d", "c")) 'ab'
Maybe I’m missing a breaking case though, or this isn’t as clear for others. Thoughts? Brandt
I like this idea quite a lot. I do not think of anything it works best at first consideration. On Sat, Mar 30, 2019, 8:28 PM Brandt Bucher <brandtbucher@gmail.com> wrote:
One thing I love about .startswith() and .endswith() is matching multiple options. It's a little funny the multiple options must be a tuple exactly (not a list, not a set, not an iterator), but whatever. It would be about to lack that symmetry in the .cut_suffix() method.
E.g now:
if fname.endswith(('.jpg', '.png', '.gif)): ...
I'd expect to be able to do:
basename = fname.cut_suffix(('.jpg', '.png', '.gif))
An idea worth considering: one can think of the “strip” family of methods as currently taking an iterable of strings as an argument (since a string is itself an sequence of strings):
"abcd".rstrip("dc") 'ab'
It would not be a huge logical leap to allow them to take any iterable. Backward compatible, no new methods:
fname.rstrip(('.jpg', '.png', '.gif'))
It even, in my opinion, can clarify "classic" strip/rstrip/lstrip usage:
"abcd".rstrip(("d", "c")) 'ab'
Maybe I’m missing a breaking case though, or this isn’t as clear for others. Thoughts?
Brandt _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On Sun, Mar 31, 2019 at 3:28 AM Brandt Bucher <brandtbucher@gmail.com> wrote:
An idea worth considering: one can think of the “strip” family of methods as currently taking an iterable of strings as an argument (since a string is itself an sequence of strings):
It would not be a huge logical leap to allow them to take any iterable. Backward compatible, no new methods:
fname.rstrip(('.jpg', '.png', '.gif'))
It even, in my opinion, can clarify "classic" strip/rstrip/lstrip usage:
"abcd".rstrip(("d", "c")) 'ab'
Maybe I’m missing a breaking case though, or this isn’t as clear for others. Thoughts?
Now with this syntax I would have to write: string.rstrip ([substring]) Which means I would have to remember to put extra brackets. Not that bad, but needs extra caution, since it would release the type of the argument and IMO in the end may become confusing. So you change the behavior by the type of the argument variable. And again - having an iterable with *multiple* elements needs useful and transparent behavior first. The only intuitive behavior seems to be the algorithm: - take first element - check if it's the string suffix - if yes, cut it off and STOP the iteration - if not, repeat check with next element So it guarantees that only one (first) found element will be cut off. I don't see other useful cases yet. Even this one seems to be odd. More complicated behavior would be just too hard to follow. IMO having separate method is more user-friendly. As for the name, I like "rcut", "lcut" Self-explanative enough and matches other similar existing methods names.
On 30 Mar 2019, at 11:21, Paul Moore
Note that the proposed name (trim) is IMO "poorly named", because a number of languages in my experience use that name for what Python calls "strip", so there would be continual confusion (for me at least) over which name meant which behaviour..
That isn't the proposal as it stands now. The consensus among those supporting the idea is "strip_prefix" and "strip_suffix". Let's debate that. / Anders
Anders Hovmöller writes:
That isn't the proposal as it stands now. The consensus among those supporting the idea is "strip_prefix" and "strip_suffix". Let's debate that.
IMO, the "prefix" and "suffix" parts are necessary to fully clarify the intent (and do that well), so any of the verbs "strip", "trim", or "cut" work for me. I prefer Steven d'A's "cutprefix" and "cutsuffix" as the shortest, and by analogy to startswith/endswith, I'd drop the underscore (PEP 8 nothwithstanding).
On Sat, Mar 30, 2019 at 10:21:23AM +0000, Paul Moore wrote:
On Fri, 29 Mar 2019 at 23:07, Christopher Barker <pythonchb@gmail.com> wrote:
The proposal at hand is to add two fairly straightforward methods to string. So:
Some of what you are calling digressions are actually questioning the design choice behind that proposal. Specifically, there's no particular justification given for making these methods rather than standalone functions.
Strings are objects and Python is an object-oriented language. Surely the default presumption ought to be that string functionality goes into the string object as methods, not a seperate function, unless they're so specialised, or so large and unwieldy, that they ought to go into a module. There's a cost to moving functionality into a seperate module. Its harder to discover functions buried in a module when most of the string functionality is in str itself, and its rather a nuisance to write (e.g.): unicodedata.name('x') instead of 'x'.name(). I use unicodedata *a lot* and there is never a time I didn't wish it was built into str instead. Once upon a time all the useful str methods were functions in the string module. I don't think we should be re-introducing that annoyance. [...]
And how do we decide if they are poorly named, given that it's *very* hard to get real-world usage experience for a core Python change before it's released (essentially no-one uses pre-releases for anything other than testing that the release doesn't break their code).
By common sense. s.xyahezgnfspwq(prefix) s.lt(prefix) s.remove_the_prefix_but_only_if_it_exists_on_the_left(prefix) would all be poorly named. We surely don't need real-world usage experience to know that. Eliminate the obviously bad names, and you're left with names which ought to be at least reasonable.
Note that the proposed name (trim) is IMO "poorly named", because a number of languages in my experience use that name for what Python calls "strip", so there would be continual confusion (for me at least) over which name meant which behaviour...
Well there you go, you've just answered your own question about how to tell if the name is poor. That's why we have this list :-) Personally, I don't mind "ltrim" and "rtrim". We're not obliged to present the precise same method names as other languages, any more than they're obliged to consider Python's names. But since you and some others may be confused, I'm happy to bike-shed alternatives. I especially like: cutprefix, cutsuffix lcut, rcut strip_prefix, strip_suffix but presumably somebody will object to them too :-) We didn't need a lot of real-world experience before deciding on async etc, and the disruption risked by adding new keywords is *much* more serious than adding a couple of string methods. [...]
OTOH, it's a new feature, so it won't be acceptable for backporting.
Indeed. I'm not sure why we're talking about backporting. It isn't going to happen, so let's just move on. -- Steven
On Fri, Mar 29, 2019 at 12:06:25PM +0900, Stephen J. Turnbull wrote:
Anders Hovmöller writes: [...]
just like always. This seems totally irrelevant to the discussion. And it's of course irrelevant to all the end users that aren't writing libraries but are using python directly.
No, it's not "irrelevant". I wish we all would stop using that word, and trying to exclude others' arguments in this way.
I won't comment on Anders' claim that this issue is irrelevant to the discussion, but I think he is correct about it being irrelevant to "all the end users that aren't writing libraries but are using python directly" -- or at least those on the cutting edge of 3.8. There are lots of people who will soon be using nothing older than 3.8, and they will no more care that 3.7 lacks this feature than they will care that Python 1.5 lacks Unicode, iterators, and new-style classes. More power to them :-) For the sake of the argument, I'll grant your point that libraries which support older versions of Python cannot use the new feature[1]. But those libraries, and their users, are no worse off by adding a string method which they can't yet use. They will simply continue doing whatever it is that they already do, which will remain backward compatible to 3.3 or 2.7 or however far back they go. And some day they will have dropped support for 3.7 and older, and will be able to use all the new shiny features in 3.8. After all, if "libraries that support old versions can't use this feature" was a reason to reject new features, we would never have added *any* new feature past those available in Python 1.0. New features are added for the benefit of the present and the future, not for the past.
We are balancing equities here.
Indeed, and a certain level of caution is justified -- but not so much as to cause paralysis and stagnation. There's a word for a language which has stopped changing: "dead".
We have a plethora of changes, on the one side taken by itself each of which is an improvement, but on the other taken as a group they greatly increase the difficulty of learning to read Python programs fluently.
"Greatly"? Is it truly that hard to go help(str.cutprefix) at the interactive interpreter, or look it up in the docs? I mean, if a simple string method causes a developer that much confusion, imagine how badly they will cope with async! You can't read Python programs fluently unless you understand the custom functions and classes in that program. Compared to that, I don't think that it is especially difficult to learn what a couple of new methods do. Especially if their name is self-documenting. [...]
Putting it in a library virtually guarantees it will never become popular.
Factually, you're wrong. Many libraries have moved from PyPI to the stdlib, often very quickly as they prove their worth in a deliberate test.
The Python community is not the Javascript community, we don't tend to download tiny one-or-two line libraries. And that is a good thing: https://medium.com/commitlog/the-internet-is-at-the-mercy-of-a-handful-of-pe... Putting aside all those whose are prohibited from using unapproved third-party libraries -- and there are a lot of them, from students using locked-down machines to corporate and government users where downloading unapproved software is grounds for instant dismissal -- I think most people simply couldn't be bothered installing and importing a package that offered something as simple as a couple of "cut" functions. While its true that not every two-line function needs to be in the stdlib, its often better to have it in the stdlib than expect ten thousand people to write the same two-line function over and over again.
Note that decimal was introduced with no literal syntax and is quite useful and used.
It was also added straight into the stdlib without being forced to go through the "third-party library" stage first, and with minimal discussion: https://mail.python.org/pipermail/python-dev/2003-October/thread.html If there was ever a module which *could* have proven itself as a third-party library on PyPI, it was probably Decimal. It adds an entire new numeric class, one with significant advantages (and some disadvantages) over binary floats, not just a couple of lines of code. Re-inventing the wheel is impractical: few people have the numeric know-how to duplicate that wheel, and for those who can, it would take a massive amount of effort: the Python version is over 6000 lines (including blanks and comments). If you want a Decimal type, it isn't practical to write one yourself.
If this change is going to prove it's tall enough to ride the stdlib ride, using a constructor for a derived class rather than str literal syntax shouldn't be too big a barrier to judging popularity (accounting for the annoyance of a constructor).
There's little difference between writing MyDecimal("1.2") versus Decimal("1.2"), but there's a huge annoyance factor in having to write MyString("hello world") instead of "hello world". Especially when all you want is to add a single new method and instead you have to override a dozen or more methods to return instances of MyString. And then you pass it to some function or library, and it returns a regular string again. So you're constantly playing wack-a-mole trying to discover why your MyString subclass objects are turning into regular built-in strings when you least expect it. Forget it. That's a serious PITA.
Alternatively, the features could be introduced using functions.
We specifically added a str class with methods to get away from the functions in the string module, and you want to bring them back? I think the bar for adding string functions into the string module should be much higher than adding a couple of lightweight methods. [1] Actually, they can. As we know from the transition from 2 to 3, there is often a perfectly viable solution for libraries that want to support old versions. Here is some actual code taken from one of my modules which works back to Python 2.4: try: casefold = str.casefold # Added in 3.3 (I think). except AttributeError: # Fall back version is not as good, but is good enough. casefold = str.lower So even libraries that support Python 2 can get the advantage of an accelerated C method by using this technique with a fallback to whatever they are currently using: try: lcut = str.cutprefix except AttributeError: # Fall back to pure Python version. def lcut(astring, prefix): ... -- Steven (the other one)
On 3/25/2019 6:22 AM, Jonathan Fine wrote:
Instead of naming these operations, we could use '+' and '-', with semantics:
# Set the values of the variables. >>> a = 'hello ' >>> b = 'world' >>> c = 'hello world'
# Some values between the variables. >>> a + b == c True >>> a == c - b True >>> b = -a + c True
Summary: using '-' for trimming works well for postfixes, badly for prefixes, and not at all for infixes. Clever but not too practical since trimming prefixes seems to be more common than trimming postfixes. -- Terry Jan Reedy
Thank you, this is a simple, unambiguous proposal which meets a real need and will help prevent a lot of wasted developer time misusing [lr]strip and then reporting it as a bug: remove a single prefix or suffix. This is a useful string primitive provided by other modern languages and libraries, including Go, Ruby, Kotlin, and the Apache StringUtils Java library: https://golang.org/pkg/strings/#TrimPrefix https://golang.org/pkg/strings/#TrimSuffix https://ruby-doc.org/core-2.5.1/String.html#method-i-delete_prefix https://ruby-doc.org/core-2.5.1/String.html#method-i-delete_suffix https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/remove-prefix.html https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/remove-suffix.html https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/la... https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/la... Regarding later proposals to add support for multiple affixes, to recursively delete the affix repeatedly, and to take an additional argument to limit how many affixes will be removed: YAGNI. Let's not over-engineer this to be something which is ambigious and complex. We can add more complexity later, if and when practical experience suggests: (1) that multiple affixes actually is useful in practice, not just a "Wouldn't It Be Cool???" feature; and (2) a consensus as to how to handle ambiguous cases. Until then, let's keep it simple: methods to remove a *single* prefix or suffix, precisely as given. Anything else is YAGNI and is best left for the individual programmer. -- Steven
On Sun, Mar 31, 2019, 8:11 PM Steven D'Aprano <steve@pearwood.info> wrote:
Regarding later proposals to add support for multiple affixes, to recursively delete the affix repeatedly, and to take an additional argument to limit how many affixes will be removed: YAGNI.
That's simply not true, and I think it's clearly illustrated by the example I gave a few times. Not just conceivably, but FREQUENTLY I write code to accomplish the effect of the suggested: basename = fname.rstrip(('.jpg', '.gif', '.png')) I probably do this MORE OFTEN than removing a single suffix. Obviously I *can* achieve this result now. I probably take a slightly different approach as the mood strikes me, with three or four different styles I've used. Actually, I've probably never done it in a way that wouldn't be subtly wrong for cases like 'base.jpg.gif.png.jpg.gif'.
On Sun, Mar 31, 2019 at 08:23:05PM -0400, David Mertz wrote:
On Sun, Mar 31, 2019, 8:11 PM Steven D'Aprano <steve@pearwood.info> wrote:
Regarding later proposals to add support for multiple affixes, to recursively delete the affix repeatedly, and to take an additional argument to limit how many affixes will be removed: YAGNI.
That's simply not true, and I think it's clearly illustrated by the example I gave a few times. Not just conceivably, but FREQUENTLY I write code to accomplish the effect of the suggested:
basename = fname.rstrip(('.jpg', '.gif', '.png'))
I probably do this MORE OFTEN than removing a single suffix.
Okay. Yesterday, you stated that you didn't care what the behaviour was for the multiple affix case. You made it clear that "any" semantics would be okay with you so long as it was documented. You seemed to feel so strongly about your indifference that you mentioned it in two seperate emails. That doesn't sound like someone who has a clear use-case in mind. If you're doing this frequently, then surely one of the following two alternatives apply: (1) One specific behaviour makes sense for all or a majority of your use-cases, in which case you would prefer that behaviour rather than something that you can't use. (2) Or there is no single useful behaviour that you want, perhaps all or a majority of your use-cases are different, and you'll usually need to write your own helper function to suit your own usage, no matter what the builtin behaviour is. Hence you don't care what the builtin behaviour is. Since you have no preferred behaviour, either you don't do this often enough to care (but above you say differently), or you are going to have to write your own helpers because the behaviour you need won't match the behaviour of the builtin. And you clearly don't mind this, because you stated twice that you don't care what the builtin behaviour is. So why rush to handle the multiple argument case? "YAGNI" is a misnomer, because it doesn't actually mean "you aren't (ever) going to need it". It means (generic) you don't need it *now*, but when you do, you can come back and revisit the design with concrete use-cases in mind. That's all I'm saying. For 29 years, we've done without this string primitive, and as a consequence the forums are full of examples of people misusing strip and getting it wrong. There's a clear case for the single argument version, and fixing that is the 90% solution. In comparison, we've been discussing this multiple affix feature for, what, a week? Lacking a good set of semantics for removing multiple affixes at once, we shouldn't rush to guess what people want. You don't even know what behaviour YOU want, let alone what the community as a whole needs. You won't be any worse off than you are now. You'll probably be better off, because you can use the single-affix version as the basic primitive, and build on top of that, instead of the incorrect version you currently use in an ad hoc manner: basename = fname.split(".ext")[0] # replace with fname.cut_suffix(".ext") Others have already pointed out why the split version is incorrect. For the use-case of stripping a single file extension out of a set of such extensions, while leaving all others, there's an obvious solution: if fname.endswith(('.jpg', '.png', '.gif'): basename = os.path.splitext(fname)[0] else: # Any other extension stays with the base. # (Presumably to be handled seperately?) basename = fname But a more general solution needs to decide on two issues: - given two affixes where one is an affix of the other, which wins? e.g. "abcd".cut_prefix(("a", "ab")) # should this return "bcd" or "cd"? - once you remove an affix, should you stop processing or continue? "ab".cut_prefix(("a", "b")) # should this return "b" or ""? The startswith and endswith methods don't suffer from this problem, for obvious reasons. We shouldn't add a problematic, ambiguous feature just for consistency with methods where it is not problematic or ambiguous. I posted links to prior art. Unless I missed something, not one of those languages or libraries supports multiple affixes in the one call. Don't let the perfect be the enemy of the good. In this case, a 90% solution will let us fix real problems and meet real needs, and we can always revisit the multiple affix case once we have more experience and have time to build a consensus based on actual use-cases. -- Steven
On Sun, Mar 31, 2019, 9:35 PM Steven D'Aprano <steve@pearwood.info> wrote:
That's simply not true, and I think it's clearly illustrated by the example I gave a few times. Not just conceivably, but FREQUENTLY I write code to accomplish the effect of the suggested:
basename = fname.rstrip(('.jpg', '.gif', '.png'))
I probably do this MORE OFTEN than removing a single suffix.
Okay.
Yesterday, you stated that you didn't care what the behaviour was for the multiple affix case. You made it clear that "any" semantics would be okay with you so long as it was documented. You seemed to feel so strongly about your indifference that you mentioned it in two seperate emails.
Yes. Because the multiple affix is an edge case that will rarely affect any of my code. I.e. I don't care much when a single string had multiple candidate affixes, because that's just not a common situation. That doesn't mean I'm indifferent to the core purpose that I need frequently. Any of the several possible behaviors in the edge case will not affect my desired usage whatsoever. That doesn't sound like someone who has a clear use-case in mind. If you're
doing this frequently, then surely one of the following two alternatives apply:
I don't think I've ever written code that cares about the edge case you focus on. Ok, I guess technically the code I've written is all buggy in the sense that it would behave in a manner I haven't thought through when presented with weird input. Perhaps I should always have been more careful about those edges. There simply is no "majority of the time" for a situation I've never specifically coded for. The rest gets more and more sophistical. I'm sure most people here have written code similar to this (maybe structured differently, but same purpose): for fname in filenames: basename, ext = fname.rsplit('.', 1) if ext in {'jpg', 'gif', 'png'}: do_stuff(basename) In all the times I've written things close to that, I've never thought about files named 'silly.jpg.gif.png.gif.jpg'. The sophistry is insistently asking "but what about...?" of this edge case. For 29 years, we've done without this string primitive, and as a
consequence the forums are full of examples of people misusing strip and getting it wrong.
It's interesting that you keep raising this error. I've made a whole lot of silly mistakes in Python (and other languages). I have never for a moment been tempted to think .rstrip() would remove a suffix rather than a character class. I did write the book Text Processing in Python a very long time ago, so I've thought a bit about text processing in Python. Maybe it's just that I'm comfortable enough with regexen that thinking of a character class doesn't feel strange to me. There's a clear case for the single argument version, and fixing that is
the 90% solution.
I think there's very little case for a single argument version. At best, it's a 10% solution. Lacking a good set of semantics for removing multiple affixes at once, we
shouldn't rush to guess what people want. You don't even know what behaviour YOU want, let alone what the community as a whole needs.
This is both dumb and dishonest. There are basically two choices, both completely clear. I think the more obvious one is to treat several prefixes or suffixes as substring class, much as .[rl]strip() does character class. But another choice indeed is to remove at most one of the affixes. I think that's a little bit less good for the edge case. But it would be fine also... and as I keep writing, the difference would almost always be moot, it just needs to be documented.
the use-case of stripping a single file extension out of a set of
such extensions, while leaving all others, there's an obvious solution:
if fname.endswith(('.jpg', '.png', '.gif'): basename = os.path.splitext(fname)[0]
I should probably use of.path.splitext() more than I do. But that's just an example. Another is, e.g. 'if url.startswith(('http://', 'sftp://', 's3://')): ...'. And lots of similar things that aren't addressed by os.path.splitext(). E.g. 'if logline.startswith(('WARNING', 'ERROR')): ...' I posted links to prior art. Unless I missed something, not one of those
languages or libraries supports multiple affixes in the one call.
Also, none of those languages support the amazingly useful signature of str.startswith(tuple). Well, they do in the sense they support regexen. But not as a standard method or function on strings. I don't even know if PHP with it's 5000 string functions had this great convenience.
On 1 Apr 2019, at 04:58, David Mertz <mertz@gnosis.cx> wrote:
Lacking a good set of semantics for removing multiple affixes at once, we shouldn't rush to guess what people want. You don't even know what behaviour YOU want, let alone what the community as a whole needs.
This is both dumb and dishonest. There are basically two choices, both completely clear. I think the more obvious one is to treat several prefixes or suffixes as substring class, much as .[rl]strip() does character class.
Please don't say "dumb and dishonest". Especially not when you directly follow up by radically redefining what you want. To get the same semantics as strip() it must follow that "foofoobarfoo".without_suffix(("foo", "bar")) == "" / Anders
On Mon, Apr 1, 2019 at 12:35 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Mar 31, 2019 at 08:23:05PM -0400, David Mertz wrote:
On Sun, Mar 31, 2019, 8:11 PM Steven D'Aprano <steve@pearwood.info> wrote:
Regarding later proposals to add support for multiple affixes, to recursively delete the affix repeatedly, and to take an additional argument to limit how many affixes will be removed: YAGNI.
That's simply not true, and I think it's clearly illustrated by the example I gave a few times. Not just conceivably, but FREQUENTLY I write code to accomplish the effect of the suggested:
basename = fname.rstrip(('.jpg', '.gif', '.png'))
I probably do this MORE OFTEN than removing a single suffix.
Okay.
Yesterday, you stated that you didn't care what the behaviour was for the multiple affix case. You made it clear that "any" semantics would be okay with you so long as it was documented. You seemed to feel so strongly about your indifference that you mentioned it in two seperate emails.
The multiple affix case has exactly two forms: 1) Tearing multiple affixes off (eg stripping "asdf.jpg.png" down to just "asdf"), which most people are saying "no, don't do that, it doesn't make sense and isn't needed" 2) Removing one of several options, which implies that one option is a strict subpiece of another (eg stripping off "test" and "st") If anyone is advocating for #1, I would agree with saying YAGNI. But #2 is an extremely unlikely edge case, and whatever semantics are chosen for it, *normal* usage will not be affected. In the example that David gave, there is no way for "first wins" or "longest wins" or anything like that to make any difference, because it's impossible for there to be multiple candidates. Since this would be going into the language as a feature, the semantics have to be clearly defined (with "first match wins", "longest match wins", and "raise exception" being probably the most plausible options), but most of us aren't going to care which one is picked.
That doesn't sound like someone who has a clear use-case in mind. If you're doing this frequently, then surely one of the following two alternatives apply:
(1) One specific behaviour makes sense for all or a majority of your use-cases, in which case you would prefer that behaviour rather than something that you can't use.
(2) Or there is no single useful behaviour that you want, perhaps all or a majority of your use-cases are different, and you'll usually need to write your own helper function to suit your own usage, no matter what the builtin behaviour is. Hence you don't care what the builtin behaviour is.
Or all the behaviours actually do the same thing anyway.
Lacking a good set of semantics for removing multiple affixes at once, we shouldn't rush to guess what people want. You don't even know what behaviour YOU want, let alone what the community as a whole needs.
We're basically debating collision semantics here. It's on par with asking "how should statistics.mode() cope with multiple modes?". Should the introduction of statistics.mode() have been delayed pending a thorough review of use-cases, or is it okay to make it do what most people want, and then be prepared to revisit its edge-case handling? (For those who don't know, mode() was changed in 3.8 to return the first mode encountered, in contrast to previous behaviour where it would raise an exception.)
For the use-case of stripping a single file extension out of a set of such extensions, while leaving all others, there's an obvious solution:
if fname.endswith(('.jpg', '.png', '.gif'): basename = os.path.splitext(fname)[0] else: # Any other extension stays with the base. # (Presumably to be handled seperately?) basename = fname
Sure, but I've often wanted to do something like "strip off a prefix of http:// or https://", or something else that doesn't have a semantic that's known to the stdlib. Also, this is still fairly verbose, and a lot of people are going to reach for a regex, just because it can be done in one line of code.
I posted links to prior art. Unless I missed something, not one of those languages or libraries supports multiple affixes in the one call.
And they don't support multiple affixes in startswith/endswith either, but we're very happy to have that in Python. The parallel is strong. You ask if it has a prefix, you remove the prefix. You ask if it has multiple prefixes, you remove any one of those prefixes. We don't have to worry about edge cases that are unlikely to come up in real-world code, just as long as the semantics ARE defined somewhere. ChrisA
On Mon, Apr 01, 2019 at 02:29:44PM +1100, Chris Angelico wrote:
The multiple affix case has exactly two forms:
1) Tearing multiple affixes off (eg stripping "asdf.jpg.png" down to just "asdf"), which most people are saying "no, don't do that, it doesn't make sense and isn't needed"
Perhaps I've missed something obvious (its been a long thread, and I'm badly distracted with hardware issues that are causing me some considerable grief), but I haven't seen anyone say "don't do that". But I have seen David Mertz say that this was the best behaviour: [quote] fname = 'silly.jpg.png.gif.png.jpg.gif.jpg' I'm honestly not sure what behavior would be useful most often for this oddball case. For the suffixes, I think "remove them all" is probably the best [end quote] I'd also like to point out that this is not an oddball case. There are two popular platforms where file extensions are advisory not mandatory (Linux and Mac), but even on Windows it is possible to get files with multiple, meaningful, extensions (foo.tar.gz for example) as well as periods used in place of spaces (a.funny.cat.video.mp4).
2) Removing one of several options, which implies that one option is a strict subpiece of another (eg stripping off "test" and "st")
I take it you're only referring to the problematic cases, because there's the third option, where none of the affixes to be removed clash: spam.cut_suffix(("ed", "ing")) But that's pretty uninteresting and a simple loop or repeated call to the method will work fine: spam.cut_suffix("ed").cut_suffix("ing") just as we do with replace: spam.replace(",", "").replace(" ", "") If you only have a few affixes to work with, this is fine. If you have a lot, you may want a helper function, but that's okay.
If anyone is advocating for #1, I would agree with saying YAGNI.
David Mertz did.
But #2 is an extremely unlikely edge case, and whatever semantics are chosen for it, *normal* usage will not be affected.
Not just unlikely, but "extremely" unlikely? Presumably you didn't just pluck that statement out of thin air, but have based it on an objective and statistically representative review of existing code and projections of future uses of these new methods. How could I possibly argue with that? Except to say that I think it is recklessly irresponsible for people engaged in language design to dismiss edge cases which will cause users real bugs and real pain so easily. We're not designing for our personal toolbox, we're designing for hundreds of thousands of other people with widely varying needs. It might be rare for you, but for somebody it will be happening ten times a day. And for somebody else, it will only happen once a year, but when it does, their code won't raise an exception it will just silently do the wrong thing. This is why replace does not take a set of multiple targets to replace. The user, who knows their own use-case and what behaviour they want, can write their own multiple-replace function, and we don't have to guess what they want. The point I am making is not that we must not ever support multiple affixes, but that we shouldn't rush that decision. Let's pick the low-hanging fruit, and get some real-world experience with the function before deciding how to handle the multiple affix case. [...]
Or all the behaviours actually do the same thing anyway.
In this thread, I keep hearing this message: "My own personal use-case will never be affected by clashing affixes, so I don't care what behaviour we build into the language, so long as we pick something RIGHT NOW and don't give the people actually affected time to use the method and decide what works best in practice for them." Like for the str.replace method, the final answer might be "there is no best behaviour and we should refuse to choose". Why are we rushing to permanently enshrine one specific behaviour into the builtins before any of the users of the feature have a chance to use it and decide for themselves which suits them best? Now is better than never. Although never is often better than *right* now. Somebody (I won't name names, but they know who they are) wrote to me off-list some time ago and accused me of being arrogant and thinking I know more than everyone else. Well perhaps I am, but I'm not so arrogant as to think that I can choose the right behaviour for clashing affixes for other people when my own use-cases don't have clashing affixes. [...]
Sure, but I've often wanted to do something like "strip off a prefix of http:// or https://", or something else that doesn't have a semantic that's known to the stdlib.
I presume there's a reason you aren't using urllib.parse and you just need a string without the leading scheme. If you're doing further parsing, the stdlib has the right batteries for that. (Aside: perhaps urllib.parse.ParseResult should get an attribute to return the URL minus the scheme? That seems like it would be useful.)
Also, this is still fairly verbose, and a lot of people are going to reach for a regex, just because it can be done in one line of code.
Okay, they will use a regex. Is this a problem? We're not planning on banning regexes are we? If they're happy using regexes, and don't care that it will be perhaps 3 times slower, let them.
I posted links to prior art. Unless I missed something, not one of those languages or libraries supports multiple affixes in the one call.
And they don't support multiple affixes in startswith/endswith either, but we're very happy to have that in Python.
But not until we had a couple of releases of experience with them: https://docs.python.org/2.7/library/stdtypes.html#str.endswith And .replace still only takes a single target to be replaced. [...]
We don't have to worry about edge cases that are unlikely to come up in real-world code,
And you are making that pronouncement on the basis of what? Your gut feeling? Perhaps you're thinking too narrowly. Here's a partial list of English prefixes that somebody doing text processing might want to remove to get at the root word: a an ante anti auto circum co com con contra contro de dis en ex extra hyper il im in ir inter intra intro macro micro mono non omni post pre pro sub sym syn tele un uni up I count fourteen clashes: a: an ante anti an: ante anti co: com con contra contro ex: extra in: inter intra intro un: uni (That's over a third of this admittedly incomplete list of prefixes.) I can think of at least one English suffix pair that clash: -ify, -fy. How about other languages? How comfortable are you to say that nobody doing text processing in German or Hindi will need to deal with clashing affixes? -- Steven
On Tue, Apr 2, 2019 at 11:53 AM Steven D'Aprano <steve@pearwood.info> wrote:
The point I am making is not that we must not ever support multiple affixes, but that we shouldn't rush that decision. Let's pick the low-hanging fruit, and get some real-world experience with the function before deciding how to handle the multiple affix case.
I still haven't seen anyone actually give a good reason for not going with "first wins", other than a paranoia that we don't have any real-world use-cases. And there are PLENTY of real-world use-cases where any semantics will have the same effect, and only a few where it would be at all important (and in all of those, "first wins" has been the correct semantic). By saying "let's add the method, but not give it all the power yet", you just create more version problems. "Oh, so I can use cutprefix back as far as 3.8, but if I use more than one prefix, now I have to say that this requires 3.9." Why not just give it the full power straight away? Are you actually expecting to find enough use-cases where "longest wins" or some other definition will be better? You can debate whether it's "extremely unlikely" to matter or it's "reasonably common" or whatever, but unless it ever matters AND has to be something other than first-match-wins, there's no reason not to lock in those semantics. ChrisA
On Mon, Apr 1, 2019, 8:54 PM Steven D'Aprano <steve@pearwood.info> wrote:
The point I am making is not that we must not ever support multiple affixes, but that we shouldn't rush that decision. Let's pick the low-hanging fruit, and get some real-world experience with the function before deciding how to handle the multiple affix case.
There are exactly two methods of strings that deal specifically with affixes currently. Startswith and endswith. Both of those allow specifying multiple affixes. That's pretty strong real-world experience, and breaking the symmetry for no reason is merely confusing. Especially since the consistency would be obviously as commonly useful. Now look, the sky won't fall if a single-affix-only method is added. For that matter, it won't if nothing is added. In fact, the single affix version makes it a little bit easier to write a custom function handling multiple affixes. And the sky won't fall if the remove-just-one semantics are used rather than remove-from-class. But adding methods with sneakily helpful capabilities often helps users greatly. A lot of folks in this thread didn't even know about passing a tuple to str.startswith() a few days ago. I'm pretty sure that capability was added by Raymond, who has an amazingly good sense of what little tricks can prove really powerful. Apologies to a different developer if it wasn't him, but congrats and thanks to you if so. Somebody (I won't name names, but they know who they are) wrote to me
off-list some time ago and accused me of being arrogant and thinking I know more than everyone else. Well perhaps I am, but I'm not so arrogant as to think that I can choose the right behaviour for clashing affixes for other people when my own use-cases don't have clashing affixes.
That could be me... Unless it's someone else :-). I think my intent was a bit different than you characterize, but I'm very guilty of presuming too much also. So mea culpa.
Sure, but I've often wanted to do something like "strip off a prefix
of http:// or https://", or something else that doesn't have a semantic that's known to the stdlib.
I presume there's a reason you aren't using urllib.parse and you just need a string without the leading scheme. If you're doing further parsing, the stdlib has the right batteries for that.
I know there are lots of specialized string manipulations in the STDLIB. Yeah, I could use os.path.splitext, and os.path.split, and urllib.parse.something, and lots of other things I rarely use. A lot of us like to manipulate strings in generically stringy ways. But not until we had a couple of releases of experience with them:
https://docs.python.org/2.7/library/stdtypes.html#l.endswith <https://docs.python.org/2.7/library/stdtypes.html#str.endswith>
Ok. Fair point. I used Python 2.4 without the multiple affix option. Here's a partial list of English prefixes that somebody doing text
processing might want to remove to get at the root word:
a an ante anti auto circum co com con contra contro de dis en ex extra hyper il im in ir inter intra intro macro micro mono non omni post pre pro sub sym syn tele un uni up
I count fourteen clashes:
a: an ante anti an: ante anti co: com con contra contro ex: extra in: inter intra intro un: uni
This seems like a good argument for remove-all-from-class. :-) stem = word.lstrip(prefix_tup) But the we really need 'word.porter_stemmer()' as a built-in method.
On 4/1/19 9:34 PM, David Mertz wrote:
On Mon, Apr 1, 2019, 8:54 PM Steven D'Aprano <steve@pearwood.info> wrote:
The point I am making is not that we must not ever support multiple affixes, but that we shouldn't rush that decision. Let's pick the low-hanging fruit, and get some real-world experience with the function before deciding how to handle the multiple affix case.
There are exactly two methods of strings that deal specifically with affixes currently. Startswith and endswith. Both of those allow specifying multiple affixes. That's pretty strong real-world experience, and breaking the symmetry for no reason is merely confusing. Especially since the consistency would be obviously as commonly useful.
My imagination is failing me: for multiple affixes (affices?), what is a use case for removing one, but not having the function return which one? In other words, shouldn't a function that removes multiple affixes also return which one(s) were removed? I think I'm agreeing with Steven: take the low hanging fruit now, and worry about complexification later (because I'm not sure that the existing API is good when removing multiple affixes). Stemming is hard, because a lot of words begin/end with common affixes, but that string of letters isn't always an affix. For example, removing common prefixes from "relay" leaves "lay," but that's not the root; similarly with "relax" and "area." If my algorithm is "look for the word in a list of known words, if it's there then great, but if it's not then remove one affix and try again," then I don't want to remove all the affixes at once. When removing extensions from filenames, all of my use cases involve removing one at a time and acting on the one that was removed. For example, decompressing foo.tar.gz into foo.tar, and then untarring foo.tar into foo. I suppose I can imagine removing tar.gz and then decompressing and untarring in one step, but again, then I have to know which suffixes were removed. Or maybe I could process foo.tar.gz and want to end up with foo.norm (bonus points for recognizing the XKCD reference), but my personal preference would still be to produce foo.tar.gz.norm by default and let the user specify the ultimate filename if they want something else. So I've seen someone (likely David Mertz?) ask for something like filename.strip_suffix(('.png', '.jpg')). What is the context? Is it strictly a filename processing program? Do you subsequently have to determine the suffix(es) at hand?
On Mon, Apr 1, 2019 at 10:11 PM Dan Sommers < 2QdxY4RzWzUUiLuE@potatochowder.com> wrote:
So I've seen someone (likely David Mertz?) ask for something like filename.strip_suffix(('.png', '.jpg')). What is the context? Is it strictly a filename processing program? Do you subsequently have to determine the suffix(es) at hand?
Yes, I've sometimes wanted something like "the basename of all the graphic files in that directory." But here's another example that is from my actually current job: I do machine-learning/data science for a living. As part of that, I generate a bunch of models that try to make predictions from the same dataset. So I name those models like: dataset1.KNN_distance_n10.gz dataset1.KNN_distance_n10_poly2_scaled.xz dataset2.KNN_manhattan_n6.zip dataset2.KNN_distance_n10_poly2_scaled.xz dataset1.KNN_minkowski_n5.gz dataset1.LinSVC_Poly3_Scaled.gz dataset2.LogReg.bz2 dataset2.LogReg_Poly.gz dataset1.NuSVC_poly2_scaled.gz I would like to answer the question "What types of models have I tried against the datasets?" Obviously, I *can* answer this question. But it would be pleasant to answer it like this: styles = {model.lstrip(('dataset1', 'dataset2')) .rstrip(('gz', 'xz', 'zip', 'bz2)) for model in models} That's something very close to code I actually have in production now. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Mon, Apr 01, 2019 at 09:34:21PM -0400, David Mertz wrote:
On Mon, Apr 1, 2019, 8:54 PM Steven D'Aprano <steve@pearwood.info> wrote:
The point I am making is not that we must not ever support multiple affixes, but that we shouldn't rush that decision. Let's pick the low-hanging fruit, and get some real-world experience with the function before deciding how to handle the multiple affix case.
There are exactly two methods of strings that deal specifically with affixes currently. Startswith and endswith. Both of those allow specifying multiple affixes.
When testing for the existence of a prefix (or suffix), there are no choices that need to be made for the multiple prefix case. If spam.startswith("contra"), then it also starts with "co", and we don't have to decide whether to delete six characters or two. If spam starts with one of ("de", "ex", "in", "mono"), then it doesn't matter what order we specify the tests, it will return True regardless. If you write a pure Python implementation of multiprefix startswith, there's one *obviously correct* version: def multi_startswith(astring, prefixes): return any(astring.prefix for prefix in prefixes) because it literally doesn't matter which of the prefixes triggered the match. I could randomize the order of the prefixes, and nothing would change. But if you delete the prefix, it matter a lot which prefix triggers the match.
That's pretty strong real-world experience
But not as strong as str.replace, which is much older than starts/endswith and still refuses to guess what the user expects to do with multiple substrings. -- Steven
On Mon, Apr 1, 2019 at 8:54 PM Steven D'Aprano <steve@pearwood.info> wrote:
I can think of at least one English suffix pair that clash: -ify, -fy. How about other languages? How comfortable are you to say that nobody doing text processing in German or Hindi will need to deal with clashing affixes?
Here are the 30 most common suffixes in a large list of Dutch words. For similar answers for other languages, see https://gist.github.com/DavidMertz/1a4aac0e889097d7bf80d8d41a3a644d. Note that there is absolutely nothing morphological here, simply dumb string literals: % head -30 suffix-frequency-nl.txt ('en', 55338) ('er', 14387) ('de', 12541) ('den', 11427) ('ten', 9402) ('te', 8263) ('ng', 7502) ('es', 7398) ('st', 7102) ('ing', 6949) ('gen', 6836) ('rs', 6592) ('ers', 5581) ('ren', 4842) ('el', 4602) ('ngen', 4451) ('rde', 4255) ('ken', 4203) ('re', 3870) ('je', 3868) ('len', 3784) ('ste', 3680) ('ie', 3658) ('nd', 3635) ('erde', 3620) ('rden', 3593) ('jes', 3307) ('eren', 3193) ('id', 3123) ('rd', 3083) -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
In your case you probably should use [model.split(".")[1] for model in models] strip_prefix should not be used with file extensions, method for files already exist. On Apr 2 2019, at 5:43 am, David Mertz <mertz@gnosis.cx> wrote:
On Mon, Apr 1, 2019 at 8:54 PM Steven D'Aprano <steve@pearwood.info (mailto:steve@pearwood.info)> wrote:
I can think of at least one English suffix pair that clash: -ify, -fy. How about other languages? How comfortable are you to say that nobody doing text processing in German or Hindi will need to deal with clashing affixes?
Here are the 30 most common suffixes in a large list of Dutch words. For similar answers for other languages, see https://gist.github.com/DavidMertz/1a4aac0e889097d7bf80d8d41a3a644d. Note that there is absolutely nothing morphological here, simply dumb string literals: % head -30 suffix-frequency-nl.txt ('en', 55338) ('er', 14387) ('de', 12541) ('den', 11427) ('ten', 9402) ('te', 8263) ('ng', 7502) ('es', 7398) ('st', 7102) ('ing', 6949) ('gen', 6836) ('rs', 6592) ('ers', 5581) ('ren', 4842) ('el', 4602) ('ngen', 4451) ('rde', 4255) ('ken', 4203) ('re', 3870) ('je', 3868) ('len', 3784) ('ste', 3680) ('ie', 3658) ('nd', 3635) ('erde', 3620) ('rden', 3593) ('jes', 3307) ('eren', 3193) ('id', 3123) ('rd', 3083)
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
On 02/04/2019 01:52, Steven D'Aprano wrote:
Here's a partial list of English prefixes that somebody doing text processing might want to remove to get at the root word:
a an ante anti auto circum co com con contra contro de dis en ex extra hyper il im in ir inter intra intro macro micro mono non omni post pre pro sub sym syn tele un uni up
I count fourteen clashes:
a: an ante anti an: ante anti co: com con contra contro ex: extra in: inter intra intro un: uni
(That's over a third of this admittedly incomplete list of prefixes.)
I can think of at least one English suffix pair that clash: -ify, -fy.
You're beginning to persuade me that cut/trim methods/functions aren't a good idea :-) So far we have two slightly dubious use-cases. 1. Stripping file extensions. Personally I find that treating filenames like filenames (i.e. using os.path or (nowadays) pathlib) results in me thinking more appropriately about what I'm doing. 2. Stripping prefixes and suffixes to get to root words. Python has been used for natural language work for over a decade, and I don't think I've heard any great call from linguists for the functionality. English isn't a girl who puts out like that on a first date :-) There are too many common exception cases for such a straightforward approach not to cause confusion. 3. My most common use case (not very common at that) is for stripping annoying prompts off text-based APIs. I'm happy using .startswith() and string slicing for that, though your point about the repeated use of the string to be stripped off (or worse, hard-coding its length) is well made. I am beginning to worry slightly that actually there are usually more appropriate things to do than simply cutting off affixes, and that in providing these particular batteries we might be encouraging poor practise. -- Rhodri James *-* Kynesim Ltd
On Tue, 2 Apr 2019 at 12:07, Rhodri James <rhodri@kynesim.co.uk> wrote:
So far we have two slightly dubious use-cases.
1. Stripping file extensions. Personally I find that treating filenames like filenames (i.e. using os.path or (nowadays) pathlib) results in me thinking more appropriately about what I'm doing.
I'd go further and say that filename manipulation is a great example of a place where generic string functions should definitely *not* be used.
2. Stripping prefixes and suffixes to get to root words. Python has been used for natural language work for over a decade, and I don't think I've heard any great call from linguists for the functionality. English isn't a girl who puts out like that on a first date :-) There are too many common exception cases for such a straightforward approach not to cause confusion.
Agreed, using prefix/suffix stripping on natural language is at best a "quick hack". For robust usage, one of the natural language processing packages from PyPI is likely a far better fit. But "quick hacks" using the stdlib are not an unrealistic use case, so I don't think we should completely discount this. It's certainly not *compelling*, though.
3. My most common use case (not very common at that) is for stripping annoying prompts off text-based APIs. I'm happy using .startswith() and string slicing for that, though your point about the repeated use of the string to be stripped off (or worse, hard-coding its length) is well made.
I am beginning to worry slightly that actually there are usually more appropriate things to do than simply cutting off affixes, and that in providing these particular batteries we might be encouraging poor practise.
It would be really helpful if someone could go through the various use cases presented in this thread and classify them - filename manipulation, natural language uses, and "other". We could then focus on the "other" category to get a better feel for what use cases might act as a good argument for the feature. To me, it's starting to feel like a proposal that looks deceptively valuable because it's a "natural", or "obvious", addition to make, and there's a weight of people thinking of cases where they "might find it useful", but the reality is that many of those cases are not actually as good a fit for the feature as it seems at first glance. It would help the people in favour of the proposal to make their case if they could dispel that impression by giving a clearer summary of the expected use cases... Paul
On 2 Apr 2019, at 13:23, Paul Moore <p.f.moore@gmail.com> wrote:
It would be really helpful if someone could go through the various use cases presented in this thread and classify them - filename manipulation, natural language uses, and "other". We could then focus on the "other" category to get a better feel for what use cases might act as a good argument for the feature. To me, it's starting to feel like a proposal that looks deceptively valuable because it's a "natural", or "obvious", addition to make, and there's a weight of people thinking of cases where they "might find it useful", but the reality is that many of those cases are not actually as good a fit for the feature as it seems at first glance. It would help the people in favour of the proposal to make their case if they could dispel that impression by giving a clearer summary of the expected use cases...
I found two instances of strip_prefix in the code base I work on: stripping "origin/" from git branch names, and "Author:" to get the author from log output, again from git. A good place to look for examples is this: https://github.com/search?utf8=✓&q=%22strip_prefix%28%22+extension%3Apy+language%3APython+language%3APython&type=Code&ref=advsearch&l=Python&l=Python <https://github.com/search?utf8=%E2%9C%93&q=%22strip_prefix(%22+extension:py+language:Python+language:Python&type=Code&ref=advsearch&l=Python&l=Python> A pattern that one sees quickly is that there are lots and lots of functions that strip a specific and hardcoded prefix. There's a lot of path manipulation too. And of course, there's an enormous amount of copy paste (jsfunctions.py is everywhere!). Some examples from the search above: Removing "file:" prefix: https://github.com/merijn/dotfiles/blob/43c736c73c5eda413dc7b4615bb679bd43a1... <https://github.com/merijn/dotfiles/blob/43c736c73c5eda413dc7b4615bb679bd43a18d1a/dotfiles/hg-data/hooks/bitbucket.py#L16> This is a strange one, which seems to strip different things? https://github.com/imperodesign/paas-tools/blob/649372762a18acefed0a24a970b9... <https://github.com/imperodesign/paas-tools/blob/649372762a18acefed0a24a970b93eb494529df9/deis/prd/controller/registry/tests.py#L99> Removing "master.": https://github.com/mithro/chromium-build/blob/98d83e124dc08510756906171922a2... <https://github.com/mithro/chromium-build/blob/98d83e124dc08510756906171922a22ba27b87fa/scripts/tools/dump_master_cfg.py#L67> Also not path: https://github.com/BlissRoms-x86/platform_external_swiftshader/blob/01c0db17... <https://github.com/BlissRoms-x86/platform_external_swiftshader/blob/01c0db17f511badb921efc53981849cdacb82793/third_party/subzero/bloat/bloat.py#L212> Removing "Re:" from email subject lines: https://github.com/emersion/python-emailthreads/blob/0a56af7fd6de16105c27b7c... <https://github.com/emersion/python-emailthreads/blob/0a56af7fd6de16105c27b7c149eeb0282e95e587/emailthreads/util.py#L21> Removing "MAILER_": https://github.com/vitalk/flask-mailer/blob/c724643f13e51d2e57546164e3e4abf9... <https://github.com/vitalk/flask-mailer/blob/c724643f13e51d2e57546164e3e4abf9eb5d8097/flask_mailer/util.py#L30> I'm giving up now, because I got tired :) / Anders
Anders Hovmöller writes:
Removing "file:" prefix: https://github.com/merijn/dotfiles/blob/43c736c73c5eda413dc7b4615bb679bd43a1... <https://github.com/merijn/dotfiles/blob/43c736c73c5eda413dc7b4615bb679bd43a18d1a/dotfiles/hg-data/hooks/bitbucket.py#L16>
This is interesting, because it shows the (so far standard) one-liner: word[len(prefix):] if word.startswith(prefix) else word can be improved (?!) to word[len(prefix) if word.startswith(prefix) else 0:] I don't know if this is more readable, but I think it's less so. Note that version 1 doesn't copy word if it doesn't start with prefix, while version 2 does. In many applications I can think of the results would be accumulated in a set, and version 1 equality tests will also be faster in the frequent case that the word doesn't start with the prefix. So that's the one I'd go with, as I can't think of any applications where multiple copies of the same string would be useful. Steve
On 2019-04-02 19:10, Stephen J. Turnbull wrote:
Anders Hovmöller writes:
Removing "file:" prefix: https://github.com/merijn/dotfiles/blob/43c736c73c5eda413dc7b4615bb679bd43a1... <https://github.com/merijn/dotfiles/blob/43c736c73c5eda413dc7b4615bb679bd43a18d1a/dotfiles/hg-data/hooks/bitbucket.py#L16>
This is interesting, because it shows the (so far standard) one-liner:
word[len(prefix):] if word.startswith(prefix) else word
can be improved (?!) to
word[len(prefix) if word.startswith(prefix) else 0:]
It could be 'improved' more to: word[word.startswith(prefix) and len(prefix) : ]
I don't know if this is more readable, but I think it's less so.
Note that version 1 doesn't copy word if it doesn't start with prefix, while version 2 does. In many applications I can think of the results would be accumulated in a set, and version 1 equality tests will also be faster in the frequent case that the word doesn't start with the prefix. So that's the one I'd go with, as I can't think of any applications where multiple copies of the same string would be useful.
_Neither_ version copies if the word doesn't start with the prefix. If you won't believe me, test them! :-)
On Tue, Apr 02, 2019 at 07:28:01PM +0100, MRAB wrote: [...]
word[len(prefix) if word.startswith(prefix) else 0:]
It could be 'improved' more to:
word[word.startswith(prefix) and len(prefix) : ] [...] _Neither_ version copies if the word doesn't start with the prefix. If you won't believe me, test them! :-)
That slicing doesn't make a copy of the string is an implementation- dependent optimization, not a language guarantee. It's an obvious optimization to make (and in my testing, it does work all the way back to CPython 1.5) but if you want to write implementation-independent code, you shouldn't rely on it. By the letter of the language spec, an interpreter may make a copy of a string when doing a full slice string[0:len(string)]. -- Steven
MRAB writes:
On 2019-04-02 19:10, Stephen J. Turnbull wrote:
word[len(prefix) if word.startswith(prefix) else 0:]
It could be 'improved' more to:
word[word.startswith(prefix) and len(prefix) : ]
Except that it would be asymmetric with suffix. That probably doesn't matter given the sequence[:-0] bug. BTW thank you for pointing out that bug (and not quoting the code where I deliberately explicitly introduced the null suffix! ;-) This works: word[:-len(suffix) or len(word)] if word.endswith(suffix) else word Do tutorials mention this pitfall with computed indicies (that -0 is treated as "beginning of sequence")? (I should check myself, but can't spend time this week and so probably won't. :-( )
prefix. So that's the one I'd go with, as I can't think of any applications where multiple copies of the same string would be useful.
_Neither_ version copies if the word doesn't start with the prefix. If you won't believe me, test them! :-)
Oh, I believe you. It just means somebody long ago thought more deeply about the need for copying immutable objects than I ever have.
On 2019-04-03 09:38, Stephen J. Turnbull wrote:
MRAB writes: > On 2019-04-02 19:10, Stephen J. Turnbull wrote:
> > word[len(prefix) if word.startswith(prefix) else 0:]
> It could be 'improved' more to: > > word[word.startswith(prefix) and len(prefix) : ]
Except that it would be asymmetric with suffix. That probably doesn't matter given the sequence[:-0] bug.
BTW thank you for pointing out that bug (and not quoting the code where I deliberately explicitly introduced the null suffix! ;-) This works:
word[:-len(suffix) or len(word)] if word.endswith(suffix) else word
I would've written it as: word[: len(word) - len (suffix)] if word.endswith(suffix) else word
Do tutorials mention this pitfall with computed indicies (that -0 is treated as "beginning of sequence")? (I should check myself, but can't spend time this week and so probably won't. :-( )
> > prefix. So that's the one I'd go with, as I can't think of any > > applications where multiple copies of the same string would be useful.
> _Neither_ version copies if the word doesn't start with the prefix. If > you won't believe me, test them! :-)
Oh, I believe you. It just means somebody long ago thought more deeply about the need for copying immutable objects than I ever have.
Use "without" as the action picking up on "with" as in startswith, endswith: new_string = a_string.withoutprefix( prefix ) new_string = a_sring.withoutsuffix( suffix ) And since we have "replace" "remove" would also seem obvious. new_string = a_string.removeprefix( prefix ) new_string = a_sring.removesuffix( suffix ) I know that some commented that remove sounds like its inplace. But then so does replace. Would "replacesuffix" and "replaceprefix" work? I'd default the "replacement" to the empty string "". new_string = a_string.replaceprefix( old_prefix, replacement ) new_string = a_sring.replacesuffix( old_suffix, replacement ) Barry
On 02Apr2019 12:23, Paul Moore <p.f.moore@gmail.com> wrote:
On Tue, 2 Apr 2019 at 12:07, Rhodri James <rhodri@kynesim.co.uk> wrote:
So far we have two slightly dubious use-cases.
1. Stripping file extensions. Personally I find that treating filenames like filenames (i.e. using os.path or (nowadays) pathlib) results in me thinking more appropriately about what I'm doing.
I'd go further and say that filename manipulation is a great example of a place where generic string functions should definitely *not* be used.
Filename manipulation on a path _component_ is generally pretty reliable (yes one can break things by, say, inserting os.sep). I do a fair bit of filename fiddling using string functions, and these fall into 3 categories off the top of my head: - file extensions, and here I do use splitext() - trimming extensions (only barely a second case), and it turns out the only case I could easily find using the endswith/[:-offset] incantation would probably go just as well with splitext() - normalising pathnames; as an example, for the home media library I routinely downcase filenames, convert whitespace into a dash, separate fields with "--" (eg episode designator vs title) and convert _ into a colon (hello Mac Finder and UI file save dialogues, a holdover compatibility mode from OS9) None of these seem to benefit directly from having a cutprefix/cutsuffix method. But splitext aside, I'm generally fiddling a pathname component (and usually a basename), and in that domain the general string functions are very handy and well used. So I think "filename" (basename) fiddling with str methods is actually pretty reasonable. It is _pathname_ fiddling that is hazardous, because the path separators often need to be treated specially.
2. Stripping prefixes and suffixes to get to root words. Python has been used for natural language work for over a decade, and I don't think I've heard any great call from linguists for the functionality. English isn't a girl who puts out like that on a first date :-) There are too many common exception cases for such a straightforward approach not to cause confusion.
Agreed, using prefix/suffix stripping on natural language is at best a "quick hack".
Yeah. I was looking at the prefix list from a related article and seeing "intra" and thinking "intractable". Hacky indeed. _Unless_ the word has already been qualified as suitable for the action. And once it is, a cutprefix method would indeed be handy.
3. My most common use case (not very common at that) is for stripping annoying prompts off text-based APIs. I'm happy using .startswith() and string slicing for that, though your point about the repeated use of the string to be stripped off (or worse, hard-coding its length) is well made.
In some ways the verbosity and bugproneness is my personal use case for cutprefix/cutsuffix (however spelt): - repeating the string is wordy and requires human eyeballing whenever I read it (to check for correctness); the same applies whenever I write such a piece of code - personally I'm quite prone to off-by-one errors when hand writing variations on this - a well named method is more readable and expresses intent better (the same argument holds for a standalone function, though a method is a bit better) - the anecdotally not uncommon misuse of .strip() where .cutsuffix() with be correct I confess being a little surprised at how few examples which could use cutsuffix I found in my own code, where I had expected it to be common. I find several bits line this: # parsing text which may have \r\n line endings if line.endswith('\r'): line = line[:-1] # parsing a UNIX network interface listing from ifconfig, # which varies platform to platform if ifname.endswith(':'): ifname = ifname[:-1] Here I DO NOT want rstrip() because I want to strip only one character, rather than as many as there are. So: the optional trailing marker in some input. But doing this for single character markers is much easier to get right than the broader case with longer suffixes, so I think this is not a very strong case. Fiddling the domain suffix on an email address: if not addr.endswith(old_domain): raise ValueError('addr does not end in old_domain') addr2 = addr[:-len(old_domain)] + new_domain which would be a good fit, _except_ for the sanity check. However, that sanity check is just one of a few preceeding the change, so in fact this is a good fit. I have a few classes which annotate their instances with some magic attributes. Here's a snippet from a class' __getattr__ for a db schema: if attr.endswith('_table'): # *_table ==> table "*" nickname = attr[:-6] if nickname in self.table_by_nickname: There's a little suite of "match attribute suffix, trim and do something specific with what's left" if statements. However, they are almost all of the form above, so rewriting it like this: if attr.endswith('_table'): # *_table ==> table "*" nickname = attr.cutsuffix('_table') if nickname in self.table_by_nickname: is a small improvement. Eevry magic number (the "6" above) is an opportunity for bugs.
I am beginning to worry slightly that actually there are usually more appropriate things to do than simply cutting off affixes, and that in providing these particular batteries we might be encouraging poor practise.
It would be really helpful if someone could go through the various use cases presented in this thread and classify them - filename manipulation, natural language uses, and "other".
Surprisingly for me, the big subjective win is avoiding misuse of lstrip/rstrip by having obvious better named alternatives for affix trimming. Short summary: in my own code I find oportunities for an affix trim method less common than I had expected. But I still like the "might find it useful" argument. I think I find "might find it useful" more compelling than many do. Let me explain. I think a _well_ _defined_ battery is worth including in the kit (str methods) because: - the operation is simple and well defined: people won't be confused by its purpose, and when they want it there is a reliable debugged method sitting there ready for use - variations on this get written _all the time_, and writing those variations using the method is more readable and more reliable - the existing .strip battery is misused for this purpose by accident I have in the past found myself arguing for adding little tools like this in agile teams, and getting a lot of resistence. The resistence tended to take these forms: - YAGNI. While the tiny battery _can_ be written longhand, every time that happens makes for needlessly verbose code, is an opportunity for stupid bugs, and makes code whose purpose must be _deduced_ rather than doing what it says on the tin - not in this ticket: this leads to a starvation issue - the battery never goes in with any ticket, and a ticket just for the battery never gets chosen for a sprint - we've already got this other battery; subtext "not needed" or "we don't want 2 ways to do this", my subtext "does it worse, or does something which only _looks_ like this purpose". Classic example from the codebase I was in at the time was SQL parameter insertion. Eventually I said "... this" and wrote the battery anyway. My position on cut*affix is that (a) it is easy to implement (b) it can thus be debugged once (c) it makes code clearer when used (d) it reduces the liklihood of .strip() misuse. Cheers, Cameron Simpson <cs@cskk.id.au>
On Wed, Apr 03, 2019 at 09:58:07AM +1100, Cameron Simpson wrote: [...]
Yeah. I was looking at the prefix list from a related article and seeing "intra" and thinking "intractable". Hacky indeed.
That example supports my position that we ought to be cautious about allowing multiple prefixes. The correct prefix in that case is in- not intra-. Deciding which prefix ought to take precedence requires specific domain knowledge, not a simple rule like "first|last|shortest|longest wins".
_Unless_ the word has already been qualified as suitable for the action. And once it is, a cutprefix method would indeed be handy.
Which is precisely the point. Of course stemming words in full generality is hard. It requires the nuclear reactor of something like NLTK, and even that sometimes gets it wrong. But this is not a proposal for a natural language stemmer, it is a proposal for simple battery which could be used any time you want to cut a known prefix or suffix. [...]
- the anecdotally not uncommon misuse of .strip() where .cutsuffix() with be correct
Anecdotal would be "I knew a guy who made this error", but the evidence presented is objectively verifiable posts on the bug tracker, mailing lists and especially stackoverflow showing that people need to cut affixes and misuse strip for that purpose.
I confess being a little surprised at how few examples which could use cutsuffix I found in my own code, where I had expected it to be common.
I don't expect it to be very common, just common enough to be a repeated source of pain. Its probably more common, and less specialised, than partition and zfill, but less common than startswith/endswith. [...]
if ifname.endswith(':'): ifname = ifname[:-1]
Here I DO NOT want rstrip() because I want to strip only one character, rather than as many as there are. So: the optional trailing marker in some input. But doing this for single character markers is much easier to get right than the broader case with longer suffixes, so I think this is not a very strong case.
Imagine that these proposed methods had been added in Python 2.2. Would you be even a tiny bit tempted to write that code above, or would you use the string method? Now imagine it's five years from now, and you're using Python 3.11, and you came across code somebody (possibly even you!) wrote: ifname = ifname.cutsuffix(':') Would you say "Damn, I wish that method had never been added!" and replace it with the earlier code above? Those two questions are not so much aimed at you, Cameron, personally, they're more generic questions for any reader. -- Steven
On 03Apr2019 14:54, Steven D'Aprano <steve@pearwood.info> wrote:
Now imagine it's five years from now, and you're using Python 3.11, and you came across code somebody (possibly even you!) wrote:
ifname = ifname.cutsuffix(':')
Would you say "Damn, I wish that method had never been added!" and replace it with the earlier code above?
Just a late followup to this thread. The other month I found myself doing the endwith/s=s[:-n] shuffle yet again, and wrote a pair of cutprefix and cutsuffix functions. They're available in my "cs.lex" PyPI module if anyone wants to use them. Their signature is: prefix = cutsuffix(original_string, suffix) if prefix is original_string: # suffix not present ... else: # suffix present, proceed using prefix and the converse for cutprefix. Cheers, Cameron Simpson <cs@cskk.id.au>
Thanks. I was wondering what happened to that idea. I’d like to see it revived, it seems a perfectly reasonable addition to the string object to me. And now you’ve written a prototype, there’s a straightforward proposal. -CHB On Wed, Mar 4, 2020 at 4:54 PM Cameron Simpson <cs@cskk.id.au> wrote:
On 03Apr2019 14:54, Steven D'Aprano <steve@pearwood.info> wrote:
Now imagine it's five years from now, and you're using Python 3.11, and you came across code somebody (possibly even you!) wrote:
ifname = ifname.cutsuffix(':')
Would you say "Damn, I wish that method had never been added!" and replace it with the earlier code above?
Just a late followup to this thread.
The other month I found myself doing the endwith/s=s[:-n] shuffle yet again, and wrote a pair of cutprefix and cutsuffix functions. They're available in my "cs.lex" PyPI module if anyone wants to use them. Their signature is:
prefix = cutsuffix(original_string, suffix) if prefix is original_string: # suffix not present ... else: # suffix present, proceed using prefix
and the converse for cutprefix.
Cheers, Cameron Simpson <cs@cskk.id.au> _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/CED4UE... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
On Tue, 2 Apr 2019 at 23:58, Cameron Simpson <cs@cskk.id.au> wrote:
I think I find "might find it useful" more compelling than many do. Let me explain.
I think a _well_ _defined_ battery is worth including in the kit (str methods) because:
- the operation is simple and well defined: people won't be confused by its purpose, and when they want it there is a reliable debugged method sitting there ready for use
- variations on this get written _all the time_, and writing those variations using the method is more readable and more reliable
- the existing .strip battery is misused for this purpose by accident
I have in the past found myself arguing for adding little tools like this in agile teams, and getting a lot of resistence. The resistence tended to take these forms:
- YAGNI. While the tiny battery _can_ be written longhand, every time that happens makes for needlessly verbose code, is an opportunity for stupid bugs, and makes code whose purpose must be _deduced_ rather than doing what it says on the tin
- not in this ticket: this leads to a starvation issue - the battery never goes in with any ticket, and a ticket just for the battery never gets chosen for a sprint
- we've already got this other battery; subtext "not needed" or "we don't want 2 ways to do this", my subtext "does it worse, or does something which only _looks_ like this purpose". Classic example from the codebase I was in at the time was SQL parameter insertion. Eventually I said "... this" and wrote the battery anyway.
These are very good arguments, and they aren't something I'd really thought about - they make a very good case (in general) for being sympathetic to proposals for small features that "might be useful", while also offering a couple of good tests for such proposals. "Simple and well defined" in particular strikes me as important (and it's often the one that gets lost when the bikeshedding about end cases starts ;-))
My position on cut*affix is that (a) it is easy to implement (b) it can thus be debugged once (c) it makes code clearer when used (d) it reduces the liklihood of .strip() misuse.
IMO, cut*fix at this point is mainly waiting on someone to actually put a feature request on bpo, and an implementation PR on github. At that point, whether it gets implemented will boil down to whether one of the core devs likes it enough to merge. I doubt more discussion here is going to make much difference, and the proposal isn't significant enough to warrant a PEP. Paul
Rhodri James writes: Steven d'Aprano writes:
(That's over a third of this admittedly incomplete list of prefixes.)
I can think of at least one English suffix pair that clash: -ify, -fy.
And worse: is "tries" the third person present tense of "try" or is it the plural of "trie"? Pure lexical manipulation can't tell you.
You're beginning to persuade me that cut/trim methods/functions aren't a good idea :-)
I don't think I would go there yet (well, I started there, but...).
So far we have two slightly dubious use-cases.
1. Stripping file extensions. Personally I find that treating filenames like filenames (i.e. using os.path or (nowadays) pathlib) results in me thinking more appropriately about what I'm doing.
Very much agree.
2. Stripping prefixes and suffixes to get to root words.
for suffix in english_suffixes: root = word.cutsuffix(suffix) if lookup_in_dictionary(root): do_something_appropriate_with_each_root_found() is surely more flexible and accurate than a hard-coded slice, and significantly more readable than for suffix in english_suffixes: root = word[:-len(suffix)] if word.endswith(suffix) else word if lookup_in_dictionary(root): do_something_appropriate_with_each_root_found() I think enough so that I might use a local def for cutsuffix if the method doesn't exist. So my feeling is that the use case for "or"-ing multiple suffixes is a lot weaker than it is for .endswith, but .cutsuffix itself is plausible. That said, I wouldn't add it if it were up to me. Among other things, for this root-extracting application def extract_root(word, prefix, suffix): word = word[len(prefix):] if word.endswith(prefix) else word word = word[:-len(suffix)] if word.endswith(suffix) else word # perhaps try further transforms like tri -> try here? return word and a double loop for prefix in english_prefixes: # includes '' for suffix in english_suffixes: # includes '' root = extract_root(word, prefix, suffix) if lookup_in_dictionary(root): yield root (probably recursive, as well) seems most elegant.
3. My most common use case (not very common at that) is for stripping annoying prompts off text-based APIs. I'm happy using .startswith() and string slicing for that, though your point about the repeated use of the string to be stripped off (or worse, hard-coding its length) is well made.
I don't understand this use case, specifically the opposition to hard-coding the length. Although hard-coding the length wouldn't occur to me in many cases, since I'd use # remove my bash prompt prompt_re = re.compile(r'^[^\u0000-\u001f\u007f]+ \d\d:\d\d\$ ') lines = [prompt_re.sub('', line) for line in lines] if I understand the task correctly. Similarly, there's a lot of regexp-removable junk in MTA logs, timestamps and DNS lookups for example, that can't be handled with cutprefix.
I am beginning to worry slightly that actually there are usually more appropriate things to do than simply cutting off affixes, and that in providing these particular batteries we might be encouraging poor practise.
I don't think that's a worry, at least if restricted to the single-affix form, because simply cutting off affixes is surely part of most such algorithms. The harder part is remembering that you probably have to deal with multiplicities and further transformations, but that can't be incentivized by refusing to implement .cutsuffix. It's an independent consideration. Steve
On 02/04/2019 18:55, Stephen J. Turnbull wrote:
= Me 3. My most common use case (not very common at that) is for stripping annoying prompts off text-based APIs. I'm happy using .startswith() and string slicing for that, though your point about the repeated use of the string to be stripped off (or worse, hard-coding its length) is well made.
I don't understand this use case, specifically the opposition to hard-coding the length. Although hard-coding the length wouldn't occur to me in many cases, since I'd use
# remove my bash prompt prompt_re = re.compile(r'^[^\u0000-\u001f\u007f]+ \d\d:\d\d\$ ') lines = [prompt_re.sub('', line) for line in lines]
For me it's more often like input = get_line_from_UART() if input.startswith("INFO>"): input = input[5:] do_something_useful(input) which is error-prone when you cut and paste for a different prompt elsewhere and forget to change the slice to match. -- Rhodri James *-* Kynesim Ltd
On 4/2/2019 2:02 PM, Rhodri James wrote:
On 02/04/2019 18:55, Stephen J. Turnbull wrote:
= Me > 3. My most common use case (not very common at that) is for stripping > annoying prompts off text-based APIs. I'm happy using > .startswith() and string slicing for that, though your point about > the repeated use of the string to be stripped off (or worse, > hard-coding its length) is well made.
I don't understand this use case, specifically the opposition to hard-coding the length. Although hard-coding the length wouldn't occur to me in many cases, since I'd use
# remove my bash prompt prompt_re = re.compile(r'^[^\u0000-\u001f\u007f]+ \d\d:\d\d\$ ') lines = [prompt_re.sub('', line) for line in lines]
For me it's more often like
input = get_line_from_UART() if input.startswith("INFO>"): input = input[5:] do_something_useful(input)
which is error-prone when you cut and paste for a different prompt elsewhere and forget to change the slice to match.
I originally saw this, and I thought "Yeah, me, too!". But then I realize I rarely want to do this. I almost always want to know if the string began with the prefix. I'd normally use something like this: -------------------------- for line in ["INFO>rest-of-line", "not-INFO>more-text", "text", "INFO>", ""]: start, sep, rest = line.partition("INFO>") if not start and sep: print(f"control line {rest!r}") else: print(f"data line {line!r}") output: control line 'rest-of-line' data line 'not-INFO>more-text' data line 'text' control line '' data line '' -------------------------- Breaking it out as a function gives how I'd need to call this, if we made it a function (or method on str): -------------------------- def str_has_prefix(s, prefix): '''returns (True, rest-of-string) or (False, s)''' start, sep, rest = s.partition(prefix) if not start and sep: return True, rest else: return False, s for line in ["INFO>rest-of-line", "not-INFO>more-text", "text", "INFO>", ""]: has_prefix, line = str_has_prefix(line, "INFO>") if has_prefix: print(f"control line {line!r}") else: print(f"data line {line!r}") -------------------------- Now I'll admit it's not super-efficient to create the start, sep, and rest sub-strings all the time, and maybe the test "not start and sep" isn't so obvious at first glance, but for my work this is good enough. It's not super-important how the function (or method) is implemented, I'm more concerned about the interface. If it was done in C, it obviously wouldn't call .partition(). So while I was originally +1 on this proposal, now I'm not so sure, given how I normally need to check if the string starts with a prefix and get the rest of the string if it does start with the prefix. On the other hand, just this weekend I was helping (again) with someone who misunderstood str.strip() on the bug tracker: https://bugs.python.org/issue36480, so I know .strip() and friends confuses people. But I don't think we can use that fact to say that we need .lcut()/.rcut(). It's just that as it's being proposed here, I think lcut/rcut (of whatever names) just doesn't have a useful interface, for me. I don't think I've ever wanted to remove a prefix/suffix if it existed, else use the whole string, and not know which case occurred. Eric PS: I really tried to find a way to use := in this example so I could put the assignment inside the 'if' statement, but as I think Tim Peters pointed out, without C's comma operator, you can't.
On Tue, Apr 2, 2019 at 5:43 PM Eric V. Smith <eric@trueblade.com> wrote:
PS: I really tried to find a way to use := in this example so I could put the assignment inside the 'if' statement, but as I think Tim Peters pointed out, without C's comma operator, you can't.
Conceivably cut_prefix could return None if not found. Then you could write something like: if (stripped := cut_prefix(line, "INFO>")) is not None: print(f"control line {stripped!r}") else: print(f"data line {line!r}") You could even drop "is not None" in many circumstances, if you know the cut string will be non-empty. That's actually pretty readable: if stripped := cut_prefix(line, "INFO>"): print(f"control line {stripped!r}") else: print(f"data line {line!r}")
On 2019-04-03 03:06, Stephan Hoyer wrote:
On Tue, Apr 2, 2019 at 5:43 PM Eric V. Smith <eric@trueblade.com <mailto:eric@trueblade.com>> wrote:
PS: I really tried to find a way to use := in this example so I could put the assignment inside the 'if' statement, but as I think Tim Peters pointed out, without C's comma operator, you can't.
Conceivably cut_prefix could return None if not found. Then you could write something like:
if (stripped := cut_prefix(line, "INFO>")) is not None: print(f"control line {stripped!r}") else: print(f"data line {line!r}")
You could even drop "is not None" in many circumstances, if you know the cut string will be non-empty. That's actually pretty readable:
if stripped := cut_prefix(line, "INFO>"): print(f"control line {stripped!r}") else: print(f"data line {line!r}")
-1 Sometimes you just want to remove it if present, otherwise leave the string as-is. I wouldn't want to have to write: line = line.lcut("INFO>") or line
On 2019-04-02 18:55, Stephen J. Turnbull wrote:
Rhodri James writes:
Steven d'Aprano writes:
(That's over a third of this admittedly incomplete list of prefixes.)
I can think of at least one English suffix pair that clash: -ify, -fy.
And worse: is "tries" the third person present tense of "try" or is it the plural of "trie"? Pure lexical manipulation can't tell you.
You're beginning to persuade me that cut/trim methods/functions aren't a good idea :-)
I don't think I would go there yet (well, I started there, but...).
So far we have two slightly dubious use-cases.
1. Stripping file extensions. Personally I find that treating filenames like filenames (i.e. using os.path or (nowadays) pathlib) results in me thinking more appropriately about what I'm doing.
Very much agree.
2. Stripping prefixes and suffixes to get to root words.
for suffix in english_suffixes: root = word.cutsuffix(suffix) if lookup_in_dictionary(root): do_something_appropriate_with_each_root_found()
is surely more flexible and accurate than a hard-coded slice, and significantly more readable than
for suffix in english_suffixes: root = word[:-len(suffix)] if word.endswith(suffix) else word if lookup_in_dictionary(root): do_something_appropriate_with_each_root_found()
[snip] The code above contains a subtle bug. If suffix == '', then word.endswith(suffix) == True, and word[:-len(suffix)] == word[:-0] == ''. Each time I see someone do that, I see more evidence in support of adding the method.
On Wed, Apr 3, 2019 at 5:34 AM MRAB <python@mrabarnett.plus.com> wrote:
The code above contains a subtle bug. If suffix == '', then word.endswith(suffix) == True, and word[:-len(suffix)] == word[:-0] == ''.
Each time I see someone do that, I see more evidence in support of adding the method.
Either that, or it's evidence that negative indexing is only part of the story, and we need a real way to express "zero from the end" other than negative zero. For instance, word[:<0] might mean "zero from the end", and word[:<1] would be "one from the end". As a syntactic element rather than an arithmetic one, it would be safe against accidentally slicing from the front instead of the back. But that's an idea for another day. ChrisA
I think the point Chris made about statistics.mode is important enough to start a new subthread about API design, and the lessons learned. On Mon, Apr 01, 2019 at 02:29:44PM +1100, Chris Angelico wrote:
We're basically debating collision semantics here. It's on par with asking "how should statistics.mode() cope with multiple modes?". Should the introduction of statistics.mode() have been delayed pending a thorough review of use-cases, or is it okay to make it do what most people want, and then be prepared to revisit its edge-case handling?
(For those who don't know, mode() was changed in 3.8 to return the first mode encountered, in contrast to previous behaviour where it would raise an exception.)
For those who are unaware, I was responsible for chosing the semantics of statistics.mode. My choice was to treat mode() as it is taught in secondary schools here in Australia, namely that if there are two or more equally common values, there is no mode. Statistically, there is no one right answer to how to treat multiple modes. Sometimes you treat them as true multiple modes, sometimes you say there is no mode, and sometimes you treat the fact that there are multiple modes as an artifact of the sample and pick one or another as the actual mode. There's no particular statistical reason to choose the first over the second or the third. So following the Zen, I refused to guess, and raised an exception. (I toyed with returning None instead, but decided against it for reasons that don't matter here.) This seemed like a good decision up front, and I don't remember there being any objections to that behaviour when the PEP was discussed both here and on Python-Dev. But once we had a few years of real-world practice, it turns out that: (1) Raising an exception was an annoying choice that meant that every use of mode() outside of the interactive interpreter needed to be wrapped in a try...except block, making it painful to use. (2) There are at least one good use-case for returning the first mode, even though statistically there's no reason to prefer it over any other. Importantly, that use-case was something that neither I, nor anyone involved in the original PEP debate for this, had thought of. It took a few years of actual use in the wild before anyone came up with an important, definitive use-case -- and it turns out to be completely unrelated to the statistical use of mode! Raymond Hettinger persuaded me that this non-statistics use-case was important enough for mode to pick a behaviour which has no statistical justification. (Also, many other statistics packages do the same thing, so even if we're wrong, we're no worse than everyone else.) Had I ignored the Zen and, in the face of ambiguity, *guessed* which mode to return, I could have guessed wrongly and returned one of these: - the largest mode - or the smallest - the last seen mode - the mode closest to the mean - or median, or some other measure of central tendency - or some sort of special "multi-mode" object (perhaps a list). I would have missed a real use-case that I never imagined existed, as well as a good opportunity for optimization. Raymond's new version of mode is faster as well as more useful. Because I *refused to guess* and raised an exception: (1) mode was harder to use than it should have been; (2) but we were able to change its behaviour without a lengthy and annoying depreciation period, or introducing a "new_mode" function. Knowing what I know *now*, if I were designing mode() from scratch I'd go with Raymond's design. If it is statistically unjustified, its justified by other reasons, and if it is wrong, it's not so wrong as to be useless, and its wrong in a way that many other statistics libraries are also wrong. So we're in good company. But I didn't know that *then*, and I never would have guessed that there was a non-statistical use for mode. Lesson number one: Just because you have thought about your function for five minutes, or five months, doesn't mean you have thought of all the real-world uses. Lesson number two: A lot of the Zen is intended as a joke, but the koan about refusing to guess is very good advice. When possible, be conservative, take your time to make a decision, and base it on real-world experience, not gut feelings about what is "obviously" correct. In language design even more than personal code, You Ain't Gonna Need It (Yet) applies. Lesson number three: Sometimes, to not make a decision is itself a decision. In the case of mode, I had to deal with multiple modes *somehow*, I couldn't just ignore it. Fortunately I chose to raise an exception, which made it possible to change my mind later without a lengthy deprecation period. But that in turn made the function more annoying and difficult to use in practice. But in the case of the proposed str.cut_prefix and cut_suffix methods, we can avoid the decision of what to do with multiple affixes by just not supporting them! We don't have to make a decision to raise an exception, or return X (for whatever semantics of X we choose). There's no need to choose *anything* about the multiple affix case until we have more real-world experience to make a judgement. Lesson number four: Python is nearly 30 years old, and the str.replace() method still refuses to guess how to deal with the case of multiple target strings. That doesn't make replace useless. -- Steven
On 1 Apr 2019, at 02:23, David Mertz <mertz@gnosis.cx> wrote:
On Sun, Mar 31, 2019, 8:11 PM Steven D'Aprano <steve@pearwood.info> wrote: Regarding later proposals to add support for multiple affixes, to recursively delete the affix repeatedly, and to take an additional argument to limit how many affixes will be removed: YAGNI.
That's simply not true, and I think it's clearly illustrated by the example I gave a few times. Not just conceivably, but FREQUENTLY I write code to accomplish the effect of the suggested:
basename = fname.rstrip(('.jpg', '.gif', '.png'))
I probably do this MORE OFTEN than removing a single suffix.
Doing this with a for loop and without_suffix is fine though. Without without_suffix it's suddenly error prone. With a without_suffix that takes a typle it's unclear what happens without reading the code. I think a single string argument is a great sweet spot: avoid the most error prone part and keep the loop in user code. / Anders
participants (28)
-
Alex Grigoryev
-
Alex Hall
-
Anders Hovmöller
-
Andrew Barnert
-
Barry Scott
-
Brandt Bucher
-
Cameron Simpson
-
Chris Angelico
-
Christopher Barker
-
Dan Sommers
-
David Mertz
-
Eric V. Smith
-
Ethan Furman
-
Greg Ewing
-
Guido van Rossum
-
Jonathan Fine
-
Kirill Balunov
-
Mikhail V
-
MRAB
-
Paul Moore
-
Random832
-
Rhodri James
-
Rob Cliffe
-
Robert Vanden Eynde
-
Stephan Hoyer
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy