Incremental step on road to improving situation around iterable strings
From reading many of the suggestions in this list (including some of my own, of course) there seem to be a lot of us in the Python coder community who find the current behavior of strings being iterable collections of single-character strings to be somewhat problematic. At the same time, trying to change that would obviously have vast consequences, so not something that can simply be changed, even with at a major version boundary. I propose that we start by making a small, safe step in the right direction so we can maybe reach the bigger goal in a far future release (e.g. 5.x or 6.x). The step I propose is to add a `chars` method that returns a sequence (read-only view) of single-character strings that behaves exactly the same as `str` currently does when treated as a collection. Encourage developers to use that for iterating over characters in a string instead of iterating over the string directly. In maybe Python 4 or 5, directly using an `str` instance as a collection could become a warning. BTW, while adding `chars`, it might also be nice to have `ords` which would be a view of the string's character sequence as `ord` integer values.
On Sun, Feb 23, 2020 at 11:26 AM Steve Jorgensen <stevej@stevej.name> wrote:
From reading many of the suggestions in this list (including some of my own, of course) there seem to be a lot of us in the Python coder community who find the current behavior of strings being iterable collections of single-character strings to be somewhat problematic. At the same time, trying to change that would obviously have vast consequences, so not something that can simply be changed, even with at a major version boundary.
I propose that we start by making a small, safe step in the right direction so we can maybe reach the bigger goal in a far future release (e.g. 5.x or 6.x).
This is assuming that (a) the current behaviour is actually wrong, and (b) it should eventually be changed. Do any of the core devs agree with those two assertions? ChrisA
I don't know if core devs believe it is wrong, but many of the suggestions for change here seem to express that opinion. I try to stay away from "wrong" & "right", but one of the reasons I dislike the current behavior is that it's an issue when trying to traverse a nested graph of collections. Currently, it is necessary to specifically test whether a node is an instance of `str` and stop drilling down in that case. One could say that's a minor issue, but it's a minor issue that affects a lot of development efforts and has do be worked around separately by each developer who encounters it. Another issue is one of consistency. A string is not a sequence of length-1 strings. The length-1 strings created by iterating are more like slices than members.
On Sat, Feb 22, 2020, 5:29 PM Steve Jorgensen <stevej@stevej.name> wrote:
... Currently, it is necessary to specifically test whether a node is an instance of `str` and stop drilling down in that case. One could say that's a minor issue, but it's a minor issue that affects a lot of development efforts and has do be worked around separately by each developer who encounters it. ...
Can you show us before and after code so we can gauge how or if this improves things? I'm a bit skeptical that it does anything other than replace an instance check with a different check. In Java, one of the annoyances (to me) is that strings aren't iterable or subscriptable. I would fix that if I could, which is the opposite of what you propose for Python. --- Bruce
On Sun, Feb 23, 2020 at 01:28:35AM -0000, Steve Jorgensen wrote:
Another issue is one of consistency. A string is not a sequence of length-1 strings.
Isn't it? It sure looks like a sequence of length-1 (sub)strings to me. Of course, strings can also be views as a sequence of paragraphs, lines, words, or any other sort of substring you like. Hypertalk allowed you to read strings in arbitrarily nested "chunks": get the last character of the first word of the third line of text with similar syntax for writing and iteration. One might argue that there's nothing particularly useful about characters, and treating strings as sequences of words, lines etc could be more useful. But I don't think we can argue that strings aren't sequences of substrings, including characters.
The length-1 strings created by iterating are more like slices than members.
They certainly aren't members (attributes). You can't access a particular substring using dot notation: # Give me the first char in the string c = string.0 # Syntax error. So substrings aren't attributes/members. What would you name them? The same applies to lists and other sequences as well. -- Steven
It's always been in the back of my mind that directly iterable strings are very often more trouble than they are worth. But I also agree that changing it would probably be extremely disruptive (and also might be more trouble than it's worth). I've only been at it for about 3 years, but because of iterable strings I always seem to regret not using a function like this in many contexts: def iter_nostr(iterable): if isinstance(iterable, str): raise TypeError(f"iterable cannot be a str") yield from iter(iterable) The nature of my coding work is a lot of parsing of different kinds of text files and so I've had to write a lot of functions that are meant to take in an iterable of strings. So the biggest context that comes to mind is to guard against future me mistakenly sending a string into functions that are intended to work with iterables of strings. I always seem to find a way to do this, with all kinds of head scratching results depending on the situation. I learned early on that if I don't include a guard like this, I'm going to pay for it later. --- Ricky. "I've never met a Kentucky man who wasn't either thinking about going home or actually going home." - Happy Chandler On Sat, Feb 22, 2020 at 11:09 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Feb 23, 2020 at 01:28:35AM -0000, Steve Jorgensen wrote:
Another issue is one of consistency. A string is not a sequence of length-1 strings.
Isn't it? It sure looks like a sequence of length-1 (sub)strings to me.
Of course, strings can also be views as a sequence of paragraphs, lines, words, or any other sort of substring you like. Hypertalk allowed you to read strings in arbitrarily nested "chunks":
get the last character of the first word of the third line of text
with similar syntax for writing and iteration. One might argue that there's nothing particularly useful about characters, and treating strings as sequences of words, lines etc could be more useful. But I don't think we can argue that strings aren't sequences of substrings, including characters.
The length-1 strings created by iterating are more like slices than members.
They certainly aren't members (attributes). You can't access a particular substring using dot notation:
# Give me the first char in the string c = string.0 # Syntax error.
So substrings aren't attributes/members. What would you name them? The same applies to lists and other sequences as well.
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/XFG4AL... Code of Conduct: http://python.org/psf/codeofconduct/
On Sun, Feb 23, 2020 at 12:10:24AM -0500, Ricky Teachey wrote:
I've only been at it for about 3 years, but because of iterable strings I always seem to regret not using a function like this in many contexts:
def iter_nostr(iterable): if isinstance(iterable, str):
raise TypeError(f"iterable cannot be a str")
Why is that a f-string? It's a static message. That's kind of like writing `x += eval('1')`.
yield from iter(iterable)
That would be better written as `return iter(iterable)`, and possibly more efficient too (I think).
The nature of my coding work is a lot of parsing of different kinds of text files and so I've had to write a lot of functions that are meant to take in an iterable of strings. So the biggest context that comes to mind is to guard against future me mistakenly sending a string into functions that are intended to work with iterables of strings.
Have you considered writing the functions to accept either a single string or an iterable of strings? Assuming they all take a single iter-of-strings argument as first argument, you could do that with a simple decorator: def decorate(func): @functools.wraps(func) def inner(iter_of_strings, *args, **kw): if isinstance(iter_of_strings, str): iter_of_strings = (iter_of_strings,) return func(iter_of_strings, *args, **kw) return inner -- Steven
raise TypeError(f"iterable cannot be a str")
Why is that a f-string? It's a static message. That's kind of like writing `x += eval('1')`.
Well I had cut it down from another copy/pasted version with a variable in it just for this conversation.
yield from iter(iterable)
That would be better written as `return iter(iterable)`, and possibly more efficient too (I think).
Thanks.
The nature of my coding work is a lot of parsing of different kinds of text
files and so I've had to write a lot of functions that are meant to take in an iterable of strings. So the biggest context that comes to mind is to guard against future me mistakenly sending a string into functions that are intended to work with iterables of strings.
Have you considered writing the functions to accept either a single string or an iterable of strings?
Well these are parsing FILES- not only iterables of strings but iterables of LINES- but there are certainly times when that would be useful. Assuming they all take a single iter-of-strings argument as first
argument, you could do that with a simple decorator:
def decorate(func): @functools.wraps(func) def inner(iter_of_strings, *args, **kw): if isinstance(iter_of_strings, str): iter_of_strings = (iter_of_strings,) return func(iter_of_strings, *args, **kw) return inner
The decorator idea is much more idiomatic python. Thanks for the suggestion. However in these kinds of situations if I ever were accidentally sending in a string, it is much more likely I meant to split it on lines and need to do this: if isinstance(iter_of_strings, str): iter_of_strings = iter_of_strings.split("\n") But I've also found that if I try to predict what I meant to do, instead of just reporting that I did something that doesn't seem to make sense, I'm making a mistake... Better just to create the exception and remind myself that it is not expecting a string...
I think that the "strings are an iterable of strings", i.e. an iterable of iterables onto infinity... is the last remaining common dynamic type issue with Python. However, I'd like to see the "solution" be a character type, rather than making strings not iterable, so iterating a string would yield chars, and chars would not be strings themselves, and not be iterable (or a sequence at all). This would be analogous to other iterables -- they can contain iterables, but if you keep iterating (or indexing), eventually you get to a "scalar", non iterable value. This works well, for, e.g. numpy, where each index or iteration reduces the rank of the array until you get a scalar. And we don't seem to have constant problems with functions that expect an iterable of, say, numbers getting passed a single number and thinning it's an iterable of numbers. But, as stated by the OP, it'll be a long path to get there, so I doubt it's worth it. In any case, there should be some consensus on what the long term goal is before we start proposing ways to get there. -CHB On Sat, Feb 22, 2020 at 9:12 PM Ricky Teachey <ricky@teachey.org> wrote:
It's always been in the back of my mind that directly iterable strings are very often more trouble than they are worth. But I also agree that changing it would probably be extremely disruptive (and also might be more trouble than it's worth).
I've only been at it for about 3 years, but because of iterable strings I always seem to regret not using a function like this in many contexts:
def iter_nostr(iterable): if isinstance(iterable, str): raise TypeError(f"iterable cannot be a str") yield from iter(iterable)
The nature of my coding work is a lot of parsing of different kinds of text files and so I've had to write a lot of functions that are meant to take in an iterable of strings. So the biggest context that comes to mind is to guard against future me mistakenly sending a string into functions that are intended to work with iterables of strings. I always seem to find a way to do this, with all kinds of head scratching results depending on the situation. I learned early on that if I don't include a guard like this, I'm going to pay for it later.
--- Ricky.
"I've never met a Kentucky man who wasn't either thinking about going home or actually going home." - Happy Chandler
On Sat, Feb 22, 2020 at 11:09 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Feb 23, 2020 at 01:28:35AM -0000, Steve Jorgensen wrote:
Another issue is one of consistency. A string is not a sequence of length-1 strings.
Isn't it? It sure looks like a sequence of length-1 (sub)strings to me.
Of course, strings can also be views as a sequence of paragraphs, lines, words, or any other sort of substring you like. Hypertalk allowed you to read strings in arbitrarily nested "chunks":
get the last character of the first word of the third line of text
with similar syntax for writing and iteration. One might argue that there's nothing particularly useful about characters, and treating strings as sequences of words, lines etc could be more useful. But I don't think we can argue that strings aren't sequences of substrings, including characters.
The length-1 strings created by iterating are more like slices than members.
They certainly aren't members (attributes). You can't access a particular substring using dot notation:
# Give me the first char in the string c = string.0 # Syntax error.
So substrings aren't attributes/members. What would you name them? The same applies to lists and other sequences as well.
-- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/XFG4AL... Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/2VDTWX... Code of Conduct: http://python.org/psf/codeofconduct/
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
Christopher Barker wrote:
I think that the "strings are an iterable of strings", i.e. an iterable of iterables onto infinity... is the last remaining common dynamic type issue with Python. However, I'd like to see the "solution" be a character type, rather than making strings not iterable, so iterating a string would yield chars, and chars would not be strings themselves, and not be iterable (or a sequence at all). This would be analogous to other iterables -- they can contain iterables, but if you keep iterating (or indexing), eventually you get to a "scalar", non iterable value.
I get what you're saying, and I don't categorically disagree, but… In many ways, a string is more useful to treat as a scalar than a collection, so drilling down into collections and ending up iterating individual characters as the leaves is often 1 step too far. I think either making strings not (directly) iterable or making them iterables of chars (that are not strings) would be a step in the right direction. Of those 2 ideas, I slightly prefer the option making strings effectively scalar.
On 23/02/2020 18:33, Steve Jorgensen wrote:
In many ways, a string is more useful to treat as a scalar than a collection, so drilling down into collections and ending up iterating individual characters as the leaves is often 1 step too far.
I think the key word here should be "sometimes". A lot of the time I do treat strings as scalars (or more often, string.split() as a sequence of scalars), but sometimes a string absolutely is a sequence of characters, and I want to be able to treat it as such. I have to admit that most of the time that I end up in trouble because of unexpectedly iterating through a string, it's my own stupid fault. Almost always I have been doing something over simplistic or wrong-headed, usually returning a string from a function when I meant to return a list of (one) string(s) instead. -- Rhodri James *-* Kynesim Ltd
On Mon, Feb 24, 2020, 9:27 AM Rhodri James <rhodri@kynesim.co.uk> wrote:
On 23/02/2020 18:33, Steve Jorgensen wrote:
In many ways, a string is more useful to treat as a scalar than a
collection, so drilling down into collections and ending up iterating individual characters as the leaves is often 1 step too far.
I think the key word here should be "sometimes". A lot of the time I do treat strings as scalars (or more often, string.split() as a sequence of scalars), but sometimes a string absolutely is a sequence of characters, and I want to be able to treat it as such.
Both of these needs are straightforward to address with my suggested AtomicString. def descend(obj, ...): if something: descend(AtomicString(s) for s in obj.split()) elif otherthing: descend(AtomicString (c) for c in obj) else: # non-string stuff, e.g. lists
On 2/23/20 1:49 AM, Christopher Barker wrote:
I think that the "strings are an iterable of strings", i.e. an iterable of iterables onto infinity... is the last remaining common dynamic type issue with Python.
However, I'd like to see the "solution" be a character type, rather than making strings not iterable, so iterating a string would yield chars, and chars would not be strings themselves, and not be iterable (or a sequence at all).
This would be analogous to other iterables -- they can contain iterables, but if you keep iterating (or indexing), eventually you get to a "scalar", non iterable value.
This works well, for, e.g. numpy, where each index or iteration reduces the rank of the array until you get a scalar.
And we don't seem to have constant problems with functions that expect an iterable of, say, numbers getting passed a single number and thinning it's an iterable of numbers.
But, as stated by the OP, it'll be a long path to get there, so I doubt it's worth it.
In any case, there should be some consensus on what the long term goal is before we start proposing ways to get there.
-CHB
I would agree with this. In my mind, fundamentally a 'string' is a sequence of characters, not strings, so as you iterate over a string, you shouldn't get another string, but a single character type (which Python currently doesn't have). It would be totally shocking if someone suggested that iterating a list or a tuple should return lists or tuples of 1 element, so why do strings to this? Why does string act differently? I'm not sure, but I suspect it goes back to some decisions in the beginning of the language. Making strings non-iterable would be a major break in the language. Making the results of iterating over a string not be a string, but a character type that had most of the properties of a string, except it could hold only a single character and wasn't iterable would break a lot less. Probably the only way to see would be try implementing this on a branch of the compiler, and see how many of existing libraries and open source projects break due to this. -- Richard Damon
On Sun, Feb 23, 2020 at 01:46:53PM -0500, Richard Damon wrote:
I would agree with this. In my mind, fundamentally a 'string' is a sequence of characters, not strings,
If people are going to seriously propose this Character type, I think they need to be more concrete about the proposal and not just hand-wave it as "strings are sequences of characters". Presumably you would want `mystring[0]` to return a char, not a str, but there are plenty of other unspecified details. - Should `mystring[0:1]`return a char or a length 1 str? - Presumably "Z" remains a length-1 str for backward compatibility, so how do you create a char directly? - Does `chr(n)` continue to return a str? - Is the char type a subclass of str? - Do we support mixed concatenation between str and char? - If so, does concatenating the empty string to a char give a char or a length-1 string? - Are chars indexable? - Do they support len()? If char is not a subclass of string, that's going to break code that expects that `all(isinstance(c, str) for c in obj)` to be true when `obj` happens to be a string. If char is a subclass, that means we can no longer deny that strings are sequences of strings, since chars are strings. It also means that it will break code that expects strings to be iterable, I don't have a good intuition for how much code will break or simply stop working correctly if we changed string iteration to yield a new char type instead of length-1 strings. Nor do I have a good intuition for whether this will *actually* help much code. It seems to me that there's a good chance that this could end up simply shifting isinstance tests for str in some contexts to isinstance tests for char in different contexts. Without a detailed proposal, I don't think we can judge how plausible this change might be.
so as you iterate over a string, you shouldn't get another string, but a single character type (which Python currently doesn't have). It would be totally shocking if someone suggested that iterating a list or a tuple should return lists or tuples of 1 element, so why do strings to this?
Would it be so shocking though? If lists are *linked lists* of nodes, instead of arrays, then: - iterating over linked lists of nodes gives you nodes; - and a single node is still a list; which is not terribly different from the situation with strings. But in any case, lists are not limited to only containing other lists. That's not the case for strings. You couldn't get a dict or None contained in a string. Strings can only contain substrings, which include length-1 strings. -- Steven
On Tue, Mar 3, 2020 at 10:13 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Feb 23, 2020 at 01:46:53PM -0500, Richard Damon wrote:
I would agree with this. In my mind, fundamentally a 'string' is a sequence of characters, not strings,
If people are going to seriously propose this Character type, I think they need to be more concrete about the proposal and not just hand-wave it as "strings are sequences of characters".
Presumably you would want `mystring[0]` to return a char, not a str, but there are plenty of other unspecified details.
- Should `mystring[0:1]`return a char or a length 1 str?
I'm not seriously proposing it, and I am in fact against the proposal quite strongly, but ISTM the only sane way to do things is to mirror the Py3 bytes object. Just as mybytes[0] returns an int, not a bytes, this should return a char. And that can then be the pattern for anything else that's similar.
- Presumably "Z" remains a length-1 str for backward compatibility, so how do you create a char directly?
There would probably need to be an alternative literal form. In C, "Z" is a string, and 'Z' is a char; in Python, a more logical way to do it would probably be a prefix like c"Z" - or perhaps just "Z"[0] and have done with it.
- Does `chr(n)` continue to return a str?
Logically it should return a char, and in fact would probably want to be the type, just as str/int/float etc are.
- Is the char type a subclass of str?
That way lies madness. I suggest not.
- Do we support mixed concatenation between str and char?
For the sake of backward compatibility, probably yes. But that's a weak opinion and could easily be swayed.
- If so, does concatenating the empty string to a char give a char or a length-1 string?
A length 1 string (or, per above, TypeError).
- Are chars indexable?
- Do they support len()?
No. A character is a single entity, just as an integer is. (NOTE: This discussion has been talking about "characters", but I think logically they have to be single Unicode codepoints. Thus the "length" of a character is not a meaningful quantity.)
If char is not a subclass of string, that's going to break code that expects that `all(isinstance(c, str) for c in obj)` to be true when `obj` happens to be a string.
Backward compatibility WOULD be broken by this proposal (which is part of why I'm so against it). This is one of those eggs that has to be broken to make this omelette.
If char is a subclass, that means we can no longer deny that strings are sequences of strings, since chars are strings. It also means that it will break code that expects strings to be iterable,
And that's why I say this way lies madness.
I don't have a good intuition for how much code will break or simply stop working correctly if we changed string iteration to yield a new char type instead of length-1 strings.
Nor do I have a good intuition for whether this will *actually* help much code. It seems to me that there's a good chance that this could end up simply shifting isinstance tests for str in some contexts to isinstance tests for char in different contexts.
Agreed. ChrisA
On Mar 2, 2020, at 15:13, Steven D'Aprano <steve@pearwood.info> wrote:
On Sun, Feb 23, 2020 at 01:46:53PM -0500, Richard Damon wrote:
I would agree with this. In my mind, fundamentally a 'string' is a sequence of characters, not strings,
If people are going to seriously propose this Character type, I think they need to be more concrete about the proposal and not just hand-wave it as "strings are sequences of characters".
I actually wrote a half-proposal on this several years back (plus a proof-of-concept implementation that adds the chr type, but doesn’t change str to use it or interact with it), but decided the backward compatibility problems were too big to go forward. I can dig it up if anyone’s interested, but I can summarize it and answer your questions from memory. The type is called chr, it represents a single Unicode code unit, and it can be constructed from a chr, a length-1 str, or an int. It has an __int__ method but not an __index__. The repr is chr("A"); the str is just A. It is not Iterable—this is the whole point, after all, that recursing on iter(element) hits a base case. Note that a chr can not represent an extended grapheme cluster; that would have to be represented as a str, or as some new type that’s also a Sequence[chr]. But since Python strings don’t act like sequences of EGCs today, that’s not a new problem. It could be used to hold code points (so bytes could also become a sequence of chr instead of int), but that seemed too confusing (if chr(196) could be either a UTF-8 lead byte or the single character 'Ä' depending on how you got it… that feels like reopening the door to the same problems Python 3 eliminated). It has some of the same methods as str, like lower, but not those that are container-ish like find. (What about encode? translate? I can’t remember.) Meanwhile, all the container-ish methods on str (including __contains__) can now take a chr, but can also still take a str (in which case they still do substring rather than containment tests). This is a little weird, but that weirdness is already in str today (x in y does not imply any(a==x for a in y) today; it would become true if x is chr but still not when x is str), and convenient enough that you wouldn’t want to get rid of it. It would be really nice to be able to construct a str from any Iterable[chr], but of course that can’t work. So, how do you go back to a str once you’ve converted into a different Iterable of chr? You need a new alternate constructor classmethod like str.fromchars, I think. IIRC, you can add chr to each other and to str, getting a str. You can also multiply them, getting a str (even chr("A")*1 is a str, not a chr). Again, it’s a little weird for non-sequence types to have sequence-like concat/repeat but I don’t think it looks confusing in actual examples, and again, it’s convenient enough that I think it’s worth it. I considered adding and subtracting chr+int (which does the same as chr(ord(self)+other) which makes it easier to translate a lot of C code), but without char%int that feels incomplete, while with char%int it feels confusing (it would be completely different from str%, and also, having % be arithmetic but * be repeat on the same type just seems very wrong).
Presumably you would want `mystring[0]` to return a char, not a str, but there are plenty of other unspecified details.
- Should `mystring[0:1]`return a char or a length 1 str?
A str. That’s how all sequences work—slicing a sequence (even a len=1 slice) never returns an element; it always returns a sequence of the same type (or, for some third party types, a view that duck types as the same type).
- Presumably "Z" remains a length-1 str for backward compatibility, so how do you create a char directly?
chr("Z") or, if you really want to, "Z"[0] would also work with two fewer keystrokes.
- Does `chr(n)` continue to return a str?
No, because it’s the constructor of the new type. This is a backward compatibility problem, of course. Which could be solved by naming the type something else, like char, and making chr still return a str, but IIRC, this backward compatibility problem is subsumed in the larger one below, so there’s no point fixing just this one. (Also, any new name you come up with probably already appears in lots of existing code, which would add confusion; reusing chr doesn’t have that problem.)
- Is the char type a subclass of str?
No. It doesn’t support much of the str API, including all of the ABCs, so that would badly break substitutabilty. Also, it would defeat the purpose of having a separate type.
- Do we support mixed concatenation between str and char?
Yes. See above.
- If so, does concatenating the empty string to a char give a char or a length-1 string?
str+chr, chr+str, and chr+chr all return str, always. Just like chr*int returns str even when the int is 1.
- Are chars indexable?
No.
- Do they support len()?
No.
If char is not a subclass of string, that's going to break code that expects that `all(isinstance(c, str) for c in obj)` to be true when `obj` happens to be a string.
Yes. This is the biggest backward compatibility problem, and the one that made me abandon the proposal before sharing it. A str is an Iterable of str today, and there is code that expects this. Not that often directly, but indirectly it comes up all the time. A lot of code just duck-types the elements in ways that would continue to work. But the big problem is the same one you face with any attempt to duck type str: zillions of C extension functions, both stdlib and third party, will only take a str, which you can call the PyUnicode API on. IIRC, Nick Coghlan tried creating an “encoded_str” type that is-a str and also is-a bytes (and knows its encoding), and wrote up all the problems he ran into, and most of them applied here as well. The most obvious one is str.join (and fixing that to take an Iterable[str|chr] would solve 90% of the problems with student exercises), but it’s just one of a huge number of similar functions (e.g., IIRC, _pyio.TextIOWrapper.write works, but the C version doesn’t—although I think there was a bug report for that one, so maybe it’s not true anymore), and there’s no way to fix them all. There’s another issue with whether "Z"==chr("Z") or not. If not, a lot of student exercise code and quick&dirty scripting breaks: ''.join({'a': 'n', 'b': 'o', …}.get(c, c) for c in a) won’t work anymore (although similar code using str.translate, or dict(zip("abc…", "nop…")) does still work). But if so… I can’t remember the problem (something to do with a cache somewhere; probably?), but there was one.
If char is a subclass, that means we can no longer deny that strings are sequences of strings, since chars are strings. It also means that it will break code that expects strings to be iterable,
Right, but it’s already so wrong that you don’t even have to get this far to rule it out. It’s the non-subclass answer that’s attractive (but ultimately, I think, still doesn’t pan out).
I don't have a good intuition for how much code will break or simply stop working correctly if we changed string iteration to yield a new char type instead of length-1 strings.
A lot more than I expected. :)
Nor do I have a good intuition for whether this will *actually* help much code.
In cases where you really do want to just flatten/recurse/whatever infinitely, it solves the problem perfectly. But the reality is that often you want to treat strings as atoms, not iterables of characters, so you’d still need the same switching code as today—and I’m not sure debugging problems related to chr not being usable as a str is any easier than debugging the RecursionError. For example, imagine code to dump all the “leaf values” in a JSON document. If you just blindly recurse into everything Iterable, you get a RecursionError and slap yourself and fix it. If you instead successfully write a file full of single characters, you won’t discover the problem any faster… It still might be a worthwhile tradeoff in a new Python-like language; I’m really not sure. I think I’d actually want to have a chr type, but have str not be Iterable, and instead have a property that was (in fact, separate properties for iterating code units, EGCs, and UTF-8 bytes, all as different types).
so as you iterate over a string, you shouldn't get another string, but a single character type (which Python currently doesn't have). It would be totally shocking if someone suggested that iterating a list or a tuple should return lists or tuples of 1 element, so why do strings to this?
Would it be so shocking though?
If lists are *linked lists* of nodes, instead of arrays, then:
- iterating over linked lists of nodes gives you nodes; - and a single node is still a list;
Well, usually you’d want the iterator to yield just the node.value, not the node itself, at least if you’re thinking Lisp/ML/Haskell style with cons lists. And if you’re not thinking cons lists, often the list is a handle object (e.g., if you want to mutably insert an element at the start, or if you want a double-linked list, etc.), as in C++, not a node. But there are cases where you’re dealing with complex objects that are internally linked up (e.g., I used to deal with a ton of C code that dealt with internally linked database/filesystem-style extents lists, and the handle is not the list object but an implicit thing somewhere else, and the list itself is just the node), and for cases like that, your point is valid.
The main reason for having not having characters and strings is reducing complexity. Why try to add this now for no apparent net benefit ? I think the situation with bytes (iteration returning integers instead of bytes) has shown that this not a very user friendly nor intuitive approach:
b = bytes((1,2,3,4)) b b'\x01\x02\x03\x04' b[:2] b'\x01\x02' b[:1] b'\x01' b[0] 1 # yeah, right :-)
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 03 2020)
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/
On Mar 3, 2020, at 01:09, M.-A. Lemburg <mal@egenix.com> wrote:
The main reason for having not having characters and strings is reducing complexity. Why try to add this now for no apparent net benefit ?
I don’t think the benefit is worth the (as far as I can tell insurmountable) backward compatibility cost, but you can’t argue that there is no benefit. An object whose first element is itself is a valid idea, but it’s a pathological case; you have to write something like `lst=[]; lst.append(lst)` to get one. So code like this is fine: def flatten(xs): for x in xs: if isinstance(x, Iterable): yield from flatten(x) else: yield x … in that it only infinitely recurses if you go out of your way to give it an infinitely recursive value. … except that every string is an infinitely recursive value, so all you have to do is give it 'A'. Which is not just weird in theory; it breaks perfectly sensible code like flatten. And it’s why we have to have idioms like endswith taking a str|Tuple[str] rather than any Iterable: forcing people to write s.endswith(tuple(suffixes)) when suffixes is a set Is the only reasonable way to avoid confusion when suffixes is an arbitrary iterable. And, because it comes up all the time, and many other languages don’t have this problem, it has to be explained to new students and people coming from other languages, and painfully remembered or relearned by people who usually work in Java or whatever but occasionally have to do Python. Of course regular Python developers have this drummed into their heads, and usually remember to check for str and handle it specially, and we’ve all learned to deal with the tuple-special idiom, and so on. But that doesn’t mean it’s an ideal design, just that we’ve all gotten used to it.
I think the situation with bytes (iteration returning integers instead of bytes) has shown that this not a very user friendly nor intuitive approach:
Well, it shows that using integers is confusing. In fact, it’s even worse than C, where char is an integral type but at least not the same type as int. (A char ranges from 0 to 255; its default output and input in functions like printf, and C++ streams, is as a character rather than as a number; there are a bunch of character-related functions that take char but not int, although using them with an int is usually just a warning rather than an error; etc.) That doesn’t mean a new type would be confusing:
b = bytes((1,2,3,4)) b b'\x01\x02\x03\x04' b[:2] b'\x01\x02' b[:1] b'\x01' b[0] byte(b'\x01')
In fact, it would make bytes consistent with other sequences of byte:
s = list(b) s[:1] [byte(b'\x01')] s[0] byte(b'\x01')
… without adding any new inconsistencies:
assert tuple(b[:2]) == tuple(s[:2]) assert b[0] == s[0]
The downside, of course, is having one more builtin type. But that’s not an instant disqualifier; it’s a cost to trade off with the benefits. I think if it weren’t for backward compatibility, chr might turn out to be useful enough to qualify (byte I’m much less confident of—it comes up less often, and also once you start bikeshedding the interface there’s a lot more vagueness in the concept), or at least worth having a PEP to explain why it’s rejected. (But of course “if not for backward compatibility” isn’t realistic.)
On Sun, Feb 23, 2020 at 1:15 PM Ethan Furman <ethan@stoneleaf.us> wrote:
On 02/22/2020 04:37 PM, Chris Angelico wrote:
Do any of the core devs agree with those two assertions?
If posts to -Ideas required core dev agreement this would be an empty list indeed.
Heh true. Still, there's not a lot of point discussing a minor first step if it's heading towards something that isn't likely ever to happen. ChrisA
On Sun, 23 Feb 2020 at 02:52, Chris Angelico <rosuav@gmail.com> wrote:
On Sun, Feb 23, 2020 at 1:15 PM Ethan Furman <ethan@stoneleaf.us> wrote:
On 02/22/2020 04:37 PM, Chris Angelico wrote:
Do any of the core devs agree with those two assertions?
If posts to -Ideas required core dev agreement this would be an empty list indeed.
Heh true. Still, there's not a lot of point discussing a minor first step if it's heading towards something that isn't likely ever to happen.
To me, the key point is that the proposal currently starts from the presumption that "a lot of people here find the current behaviour problematic" and leaps straight to suggesting a first step on the way to a solution. That's not how things work. Incremental changes are an implementation plan - you need to get agreement that there's a problem to be solved, and to the ultimate solution, before looking for agreement on step 1 of the implementation. What this proposal lacks is any argument intended to persuade the people for whom the current behaviour is *not* an issue, that there's a problem here that needs solving. I'm 100% happy to say that *if* we wanted to make strings non-iterable, starting by adding a `.chars()` method would be a reasonable idea. But I'm still a strong -1 on this proposal because I *don't* want to make strings non-iterable. It's possible I could be persuaded that non-iterable strings are worth changing to - so I'm fine with someone trying. But as it stands, this proposal is irrelevant to me because it's based on assumptions that I don't agree with. Framing the discussion as "here's an easy step on the way to the solution, let's do that", feels to me like precisely the sort of "thin end of the wedge" argument that is normally flagged as a *flaw* in a proposal ("yes, this looks good at first but won't it open the door to this bad consequence?") Using that sort of argument as a deliberate way of getting to a solution that many people aren't comfortable with seems at best mistaken, and at worst fairly manipulative. Whatever the OP's intentions were (and I assume they were good) framing the argument like this isn't helping their case, IMO. Paul
In order for this proposal to be seriously considered, I think it's necessary to cite many realistic examples where the current behavior is problematic enough to justify changing the current behavior, and that adding a str.chars() and eventually removing the ability to iterate over strings would provide a better fix than any existing solutions. Simply stating that much of the Python community considers the current behavior to be "somewhat problematic" based on previous suggestions is certainly not going to be adequately convincing, at least not to me. But, even in the case that we are able to conclude that the current behavior causes significant issues that can't be addressed as well with existing solutions, I strongly suspect that this is going to be the case of change that is far too fundamental to Python to be changed at this point, even with a distant deprecation warning and many years of advanced notice regarding the removal/change after that. I could be wrong, but the level of potential breakage that would occur seems to be on the level of Python 2 => 3 bytes changes (if not worse in some ways); which I suspect very few people want to repeat. Not to mention that pretty much all of the textbooks, documentation, and tutorials that cover Python fundamentals or have examples that iterate over strings (which is very common) would have to be updated. The argument in favor of changing the current behavior at any point in time, regardless of how far away, would have to be incredibly convincing. For the time being, I'm -1 on both adding a chars() and an eventual deprecation => removal of iterating over strings directly. At the present moment, I don't think it's been clearly established that the current behavior is even problematic or wrong in the first place, not to mention problematic enough to justify a massive breaking change. On Sat, Feb 22, 2020 at 7:28 PM Steve Jorgensen <stevej@stevej.name> wrote:
From reading many of the suggestions in this list (including some of my own, of course) there seem to be a lot of us in the Python coder community who find the current behavior of strings being iterable collections of single-character strings to be somewhat problematic. At the same time, trying to change that would obviously have vast consequences, so not something that can simply be changed, even with at a major version boundary.
I propose that we start by making a small, safe step in the right direction so we can maybe reach the bigger goal in a far future release (e.g. 5.x or 6.x).
The step I propose is to add a `chars` method that returns a sequence (read-only view) of single-character strings that behaves exactly the same as `str` currently does when treated as a collection. Encourage developers to use that for iterating over characters in a string instead of iterating over the string directly. In maybe Python 4 or 5, directly using an `str` instance as a collection could become a warning.
BTW, while adding `chars`, it might also be nice to have `ords` which would be a view of the string's character sequence as `ord` integer values. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/WKEFHT... Code of Conduct: http://python.org/psf/codeofconduct/
Kyle Stanley wrote:
In order for this proposal to be seriously considered, I think it's necessary to cite many realistic examples where the current behavior is problematic enough to justify changing the current behavior, and that adding a str.chars() and eventually removing the ability to iterate over strings would provide a better fix than any existing solutions. Simply stating that much of the Python community considers the current behavior to be "somewhat problematic" based on previous suggestions is certainly not going to be adequately convincing, at least not to me. But, even in the case that we are able to conclude that the current behavior causes significant issues that can't be addressed as well with existing solutions, I strongly suspect that this is going to be the case of change that is far too fundamental to Python to be changed at this point, even with a distant deprecation warning and many years of advanced notice regarding the removal/change after that.
I agree. I can barely imagine what is wrong with Python's strings. Can you please provide any example? It is a common "pattern" in any languages to walk along strings, letter by letter. Python's strings provide a powerful way of doing it --as a sequence which is a fundamental type in the language. No need to dealing with indexes, terminators, lengths, boundaries, etc. I love Python because of this and hate C's and Java's strings. On the other hand, what about slices? Since slices operate in sequences and lists, if strings are not longer sequences, how would Python do slicing on strings according to your proposal? I think strings as immutable strings is indeed a wise implementation decision on Python. Thank you.
jdveiga@gmail.com wrote:
In order for this proposal to be seriously considered, I think it's necessary to cite many realistic examples where the current behavior is problematic enough to justify changing the current behavior, and that adding a str.chars() and eventually removing the ability to iterate over strings would provide a better fix than any existing solutions. Simply stating that much of the Python community considers the current behavior to be "somewhat problematic" based on previous suggestions is certainly not going to be adequately convincing, at least not to me. But, even in the case that we are able to conclude that the current behavior causes significant issues that can't be addressed as well with existing solutions, I strongly suspect that this is going to be the case of change that is far too fundamental to Python to be changed at this point, even with a distant deprecation warning and many years of advanced notice regarding the removal/change after that. I agree. I can barely imagine what is wrong with Python's strings. Can you please
Kyle Stanley wrote: provide any example? It is a common "pattern" in any languages to walk along strings, letter by letter. Python's strings provide a powerful way of doing it --as a sequence which is a fundamental type in the language. No need to dealing with indexes, terminators, lengths, boundaries, etc. I love Python because of this and hate C's and Java's strings. On the other hand, what about slices? Since slices operate in sequences and lists, if strings are not longer sequences, how would Python do slicing on strings according to your proposal? I think strings as immutable strings is indeed a wise implementation decision on Python. Thank you.
The only change I am proposing is that the iterability for characters in a string be moved from the string object itself to a view that is returned from a `chars()` method of the string. Eventually, direct iteratability would be deprecated and then removed. I do not want indexing behavior to be moved, removed, or altered, and I am not suggesting that it would/should be.
On Sun, Feb 23, 2020 at 08:51:55PM -0000, Steve Jorgensen wrote:
The only change I am proposing is that the iterability for characters in a string be moved from the string object itself to a view that is returned from a `chars()` method of the string. Eventually, direct iteratability would be deprecated and then removed.
I do not want indexing behavior to be moved, removed, or altered, and I am not suggesting that it would/should be.
You can't have both of those behaviours at the same time. Well, technically you might be able to get that to work[1], but it would be awfully surprising. Fundamentally, iteration is equivalent to repeated indexing. To break that invariant would make strings the mother of all special cases breaking the rules. Python has not just the "iterator protocol" using `__iter__` and `__next__`, but also has an older sequence protocol used in Python 1.x which still exists to this day. This sequence protocol falls back on repeated indexing. Conceptually, we should be able to reason that every object that supports indexing should be iterable, without adding a special case exception "...except for str". [1] Setting `__iter__` to None, or having it raise TypeError, appears to work. Although perhaps it shouldn't? -- Steven
Steven D'Aprano wrote:
On Sun, Feb 23, 2020 at 08:51:55PM -0000, Steve Jorgensen wrote: Python has not just the "iterator protocol" using __iter__ and __next__, but also has an older sequence protocol used in Python 1.x which still exists to this day. This sequence protocol falls back on repeated indexing. Conceptually, we should be able to reason that every object that supports indexing should be iterable, without adding a special case exception "...except for str".
Strings already have an exception in this area. Usually `x in y` means `any(x == elem for elem in y)`. It makes the two meanings of `in` match, and to me (I don't know if this is true) it's the reason that iterating over dictionaries yields the keys, although personally I'd find it more convenient if it yielded the items. But `in` means something else for strings. It's not as strong a rule as the link between iteration and indexing, but it is a break in tradition. Another somewhat related example: we usually accept that basically every object can be treated as a boolean, even more so of it has a `__len__`. But numpy and pandas break this 'rule' by raising an exception if you try to treat an array as a boolean, e.g: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() In a sense these libraries decided that while unambiguous behaviour could be defined, the intention of the user would always be ambiguous. The same could be said for strings. Needing to iterate over a string is simply not common unless you're writing something like a parser. So even though the behaviour is well defined and documented, when someone tries to iterate over a string, statistically we can say there's a good chance that's not what they actually want. And in the face of ambiguity, refuse the temptation to guess. I do think it would be a pity if strings broke the tradition of indexable implies iterable, but "A Foolish Consistency is the Hobgoblin of Little Minds". The benefits in helping users when debugging would outweigh the inconsistency and the minor inconvenience of adding a few characters. Users who are expecting iteration to work because indexing works will quickly get a helpful error message and fix their problem. At the risk of overusing classic Python sayings, Explicit is better than implicit. However, we could get the benefit of making debugging easier without having to actually *break* any existing code if we just raised a warning whenever someone iterates over a string. It doesn't have to be a deprecation warning and we don't need to ever actually make strings non-iterable. I'm out of time, so I'll just quickly say that I prefer `.chars` as a property without the `()`. And jdveiga you asked what would be the advantage of all this after I made my previous post about it biting beginners, I'm not sure if you missed that or you were just typing yours when I made mine.
Alex Hall wrote:
Strings already have an exception in this area. Usually x in y means any(x == elem for elem in y). It makes the two meanings of in match
Actually, `in` means the same in strings, in sequences, in lists, etc. But, in Python's view, a string is composed of sub-strings, the smallest one being a character. So `in` looks for substrings and characters in strings. Looks strange to you? It is coherent with the nature of strings in Python. , and to me (I don't know if this is true) it's the reason that iterating over
dictionaries yields the keys, although personally I'd find it more convenient if it yielded the items.
Oh, I think that dictionaries iterate over keys because you can get the associate item using the keys (not vice-versa). But, probably this is not the real reason.
Another somewhat related example: we usually accept that basically every object can be treated as a boolean, even more so of it has a __len__. But numpy and pandas break this 'rule' by raising an exception if you try to treat an array as a boolean,
Well, I think that some objects (mainly implementing __bool__ and some related magic methods) can be evaluated as boolean. Of course, __bool__ and __len__ can be overridden and objects can refuse to be evaluated as boolean as you pointed out. I do not catch the analogy with str class... Every object, every class can be redefined in Python...
In a sense these libraries decided that while unambiguous behaviour could be defined, the intention of the user would always be ambiguous. The same could be said for strings. Needing to iterate over a string is simply not common unless you're writing something like a parser. So even though the behaviour is well defined and documented, when someone tries to iterate over a string, statistically we can say there's a good chance that's not what they actually want.
Are you implying that developers are wrong when they iterate over strings? Seriously? Does it matter in any case? Strings must be defined in Python in some way. Immutable sequence strings was the choice. If they are sequences, they must behave as sequences. If they were foo objects, they must behave foo objects. The implementation, the syntax, and the semantics of strings are coherent in Python. If you want to change this, you must change the foundations of Python strings. We should not be wrong about this. Ultimately, it does matter how many people iterate on strings. That is not the question.
And in the face of ambiguity, refuse the temptation to guess. I do think it would be a pity if strings broke the tradition of indexable implies iterable, but "A Foolish Consistency is the Hobgoblin of Little Minds". The benefits in helping users when debugging would outweigh the inconsistency and the minor inconvenience of adding a few characters. Users who are expecting iteration to work because indexing works will quickly get a helpful error message and fix their problem. At the risk of overusing classic Python sayings, Explicit is better than implicit. However, we could get the benefit of making debugging easier without having to actually break any existing code if we just raised a warning whenever someone iterates over a string. It doesn't have to be a deprecation warning and we don't need to ever actually make strings non-iterable.
I do not agree at all. Every programming language makes a compromise. Languages are defined by what they do and what they do not. Python has chosen to be a immutable string sequence language and in my humble opinion it has been coherent with that choice. Other alternatives would be chosen, of course. But they were not. It is not a question of right or wrong, better or worse. It is a question of being consistent. And, I should say, Python is consistent in this particular point.
I'm out of time, so I'll just quickly say that I prefer .chars as a property without the (). And jdveiga you asked what would be the advantage of all this after I made my previous post about it biting beginners, I'm not sure if you missed that or you were just typing yours when I made mine.
Yeah, I was typing and, yeah, I had answered you as soon as I saw your message. Sorry, but I do not agree with you once again ;-)
A library implemented in a confusing way is not an example of nothing wrong on Python strings. (I myself has made this stupid mistake many times and I cannot blame neither Python nor sqlite for being careless.) In my humble opinion, your example does not prove that iterable strings are faulty. They can be tricky in some occasions, I admit it... but there are many tricks in all programming languages especially for newbies (currently I am trying to learn Lisp... again).
In a sense we agree. Python strings are not wrong or faulty. I think both sides of this thread are making good points, but it's ultimately a very academic discussion. Strings blur the line between scalars and iterables. Them being iterable is a bit weird sometimes and can make some code messier but it's easy enough to deal with when you know what you're doing. That kind of thing is not a good enough reason to make any drastic changes. But as you say, they can be tricky, and that's a real problem worth paying serious attention to. I don't understand your dismissal that there are many tricks in all languages. Sure that's inevitable to a degree, but shouldn't we try to make things less tricky where we can? Python strives to be easy to use and easy to learn for beginners. Accidentally iterating over strings has probably caused many hours of frustration and confusion. It probably doesn't have that effect on anyone in this mailing list because we understand Python deeply, but we need to consider the beginner's perspective.
Actually, `in` means the same in strings, in sequences, in lists, etc.
No, it really doesn't. `x[start:end] in x` is generally only True for strings, not any other collection. Quoting from https://docs.python.org/3/reference/expressions.html#membership-test-operati... : For container types such as list, tuple, set, frozenset, dict, or collections.deque, the expression x in y is equivalent to any(x is e or x == e for e in y). For the string and bytes types, x in y is True if and only if *x* is a substring of *y*. For user-defined classes which do not define __contains__() <https://docs.python.org/3/reference/datamodel.html#object.__contains__> but do define __iter__() <https://docs.python.org/3/reference/datamodel.html#object.__iter__>, x in y is True if some value z, for which the expression x is z or x == z is true, is produced while iterating over y. Lastly, the old-style iteration protocol is tried: if a class defines __getitem__() <https://docs.python.org/3/reference/datamodel.html#object.__getitem__>, x in y is True if and only if there is a non-negative integer index *i* such that x is y[i] or x == y[i], Strings and bytes clearly stick out as behaving differently from every built in container type and they deviate from the default implementation in terms of both __iter__ and __getitem__. And that's fine! The behaviour is very useful. It would be sad if `c in string` was only true if `c` was a single character. My point is that sometimes the protocols and magic methods in Python aren't always in perfectly consistent harmony. Remember that I was responding to this:
Conceptually, we should be able to reason that every object that supports indexing should be iterable, without adding a special case exception "...except for str".
We already have a special case exactly like that and it's a good thing, so it wouldn't be outrageous to add another.
Are you implying that developers are wrong when they iterate over strings?
Roughly, though I think you might be hearing me wrong. There is lots of existing code that correctly and intentionally iterates over strings. And code that unintentionally does it probably doesn't live for long. But if you took a random sample of all the times that someone has written code that creates new behaviour which iterates over a string, most of them would be mistakes. And essentially the developer was 'wrong' in those instances. In my case, since I can't think of when I've needed to iterate over a string, I've probably been wrong at least 90% of the time.
Does it matter in any case?
Yes, because it wastes people's time and energy debugging.
Strings must be defined in Python in some way.
We can choose to define them differently.
The implementation, the syntax, and the semantics of strings are coherent in Python.
They are not entirely coherent, as I have explained, and they do not have to meet any particular standard of coherence.
Ultimately, it does [not] matter how many people iterate on strings. That is not the question.
It matters a lot, I don't know why you assert that.
And in the face of ambiguity, refuse the temptation to guess. I do think it would be a pity if strings broke the tradition of indexable implies iterable, but "A Foolish Consistency is the Hobgoblin of Little Minds". The benefits in helping users when debugging would outweigh the inconsistency and the minor inconvenience of adding a few characters. Users who are expecting iteration to work because indexing works will quickly get a helpful error message and fix their problem. At the risk of overusing classic Python sayings, Explicit is better than implicit. However, we could get the benefit of making debugging easier without having to actually break any existing code if we just raised a warning whenever someone iterates over a string. It doesn't have to be a deprecation warning and we don't need to ever actually make strings non-iterable.
I do not agree at all.
What do you not agree with? Do you think it's more than a minor inconvenience to add ".chars()" here and there? Do you think that the benefits to debugging would be minor? Do you think that the inconsistency would significantly hurt users? I haven't seen an argument for any of these and I don't know if anything else I said was debateable.
It is not a question of right or wrong, better or worse. It is a question of being consistent.
Why would that be the question? Why is consistency more important than "better or worse"? How can you make such a bold claim?
On Tue, Feb 25, 2020 at 3:21 AM Alex Hall <alex.mojaki@gmail.com> wrote:
It is not a question of right or wrong, better or worse. It is a question of being consistent.
Why would that be the question? Why is consistency more important than "better or worse"? How can you make such a bold claim?
Inconsistency leads to ridiculous situations where things change from working to nonworking when you make what should be an insignificant change. Consider: // JavaScript let obj = { attr: 1, method: function() {return this.attr;}, } obj.method() # returns 1 [obj.method][0]() # returns undefined // PHP, older versions - fortunately fixed function get_array() {return array(1, 2, 3);} $arr = get_array(); echo $arr[0]; // Fine echo get_array()[0]; // Broken until PHP 5.something # Python nan = float("nan") nan == nan # False [nan] == [nan] # True Go ahead, explain these differences to a newish programmer of each language. Explain why these behave differently. Now would you like to explain why, with strings, you can say s[0], s[1], s[2] etc, but you can't iterate over it, either with a 'for' loop or with any other construct that iterates (for instance, you can't say a,b,c = s because that is done with iteration). Especially, explain why you can do this with literally every other subscriptable type in the core language (and most from third-party classes too), and it's only strings that are bizarre. Oh but they *aren't* bizarre in older versions, so there'll be code on the internet that works that way, just don't do it any more. Have fun explaining that. Consistency *is* valuable. ChrisA
Of course consistency is valuable. I am asking how it is automatically more valuable than "better or worse", which isn't an idea that makes sense to me. Consistency isn't axiomatically valuable, it's valuable for reasons such as what you've explained which ultimately boil down to what's better or worse. Those reasons can be outweighed by other considerations. Consistency is not *infinitely* valuable. Here's how I think the journey for a beginner discovering this stuff would go. First they run some code that iterates directly over a string. And doing so works, because I'm not advocating removing iterability at any point. But a warning with some message about iterating over a string shows up. It suggests using `.chars()`, which they do, and the warning goes away. That would probably be the end of it. From the assumptions in this scenario, we're talking about a beginner - specifically one who might have trouble understanding the kinds of things we're discussing, and who has never iterated over a string before (which, if I am to be generous to your side, is supposedly a common activity). I don't think it's likely that they will have drawn the connection yet between indexing and iteration. But let's assume they have, and they're curious why they got a warning. Or more likely, let's assume that they don't immediately figure out what to do from the warning message. The obvious next step is to search for the message online. This is a specific warning regarding iterating over strings, so the results will specifically be about this topic. Through Google's algorithm and Stack Overflow's votes, they will probably land on an explanation that is widely regarded as one of the best online, so probably better than what I'm about to offer. But here's what I would write: "Iterating through a string has always meant iterating over its characters. But there aren't many situations where this is useful. In most cases iterating over a string directly is a sign that a mistake has been made, such as passing a string to a function instead of a list containing one string. This often leads to mysterious behaviour that can be hard to understand and fix. To help users find these mistakes more quickly, the warning was added in recent versions of Python, along with the chars() method to signal to both human readers and Python the intent to iterate over characters. This makes strings a gentle exception to the common convention that indexing and iteration have matching behaviour." To be fair, I think many users will not initially understand the explanation. But they will move on with their lives. Eventually though, they will accidentally iterate over a string, and the warning will appear again, and this time it will probably be helpful to them. Then, whether or not they remember the explanation, or even if they didn't read it in the first place, they will probably start to understand why that warning exists. And that's the 'worst case' scenario from the perspective of your comment. I think the most likely scenario is that they will see the warning in a helpful context before they try iterating over a string, so they will have some experience to help them understand before they get a chance to wonder about the inconsistency. So in a nutshell: 1. I think that users being confused about this inconsistency is not going to harm them as much as you seem to be claiming. 2. I think that users will learn to understand the reasoning behind inconsistency from practical experience, more than any explanation. By contrast, the examples you've given are definitely harmful and difficult to understand. I've been personally baffled by the first example <https://stackoverflow.com/questions/37230877/why-cant-i-store-a-reference-to-elem-show-and-call-it-later>. But the real problem with them is deeper than inconsistency. The problem is that they are surprising, confusing, and difficult to debug. Which is exactly the kind of bug that the proposed warning is meant to defend against, and definitely not the kind of problem that will be caused by seeing said warning and contemplating inconsistencies in the data model. On Mon, Feb 24, 2020 at 6:37 PM Chris Angelico <rosuav@gmail.com> wrote:
On Tue, Feb 25, 2020 at 3:21 AM Alex Hall <alex.mojaki@gmail.com> wrote:
It is not a question of right or wrong, better or worse. It is a question of being consistent.
Why would that be the question? Why is consistency more important than "better or worse"? How can you make such a bold claim?
Inconsistency leads to ridiculous situations where things change from working to nonworking when you make what should be an insignificant change. Consider:
// JavaScript let obj = { attr: 1, method: function() {return this.attr;}, } obj.method() # returns 1 [obj.method][0]() # returns undefined
// PHP, older versions - fortunately fixed function get_array() {return array(1, 2, 3);} $arr = get_array(); echo $arr[0]; // Fine echo get_array()[0]; // Broken until PHP 5.something
# Python nan = float("nan") nan == nan # False [nan] == [nan] # True
Go ahead, explain these differences to a newish programmer of each language. Explain why these behave differently.
Now would you like to explain why, with strings, you can say s[0], s[1], s[2] etc, but you can't iterate over it, either with a 'for' loop or with any other construct that iterates (for instance, you can't say a,b,c = s because that is done with iteration). Especially, explain why you can do this with literally every other subscriptable type in the core language (and most from third-party classes too), and it's only strings that are bizarre. Oh but they *aren't* bizarre in older versions, so there'll be code on the internet that works that way, just don't do it any more.
Have fun explaining that. Consistency *is* valuable.
ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/547XBH... Code of Conduct: http://python.org/psf/codeofconduct/
On Tue, Feb 25, 2020 at 5:52 AM Alex Hall <alex.mojaki@gmail.com> wrote:
Of course consistency is valuable. I am asking how it is automatically more valuable than "better or worse", which isn't an idea that makes sense to me. Consistency isn't axiomatically valuable, it's valuable for reasons such as what you've explained which ultimately boil down to what's better or worse. Those reasons can be outweighed by other considerations. Consistency is not *infinitely* valuable.
Thing is, "better or worse" could mean an extremely marginal difference, whereas "consistent or inconsistent" is a fairly sharp distinction. So it's a question of HOW MUCH worse it's okay to be, to avoid being inconsistent. And actually it's okay to be quite a lot worse in a vacuum, if it's better for consistency.
But let's assume they have, and they're curious why they got a warning. Or more likely, let's assume that they don't immediately figure out what to do from the warning message. The obvious next step is to search for the message online. This is a specific warning regarding iterating over strings, so the results will specifically be about this topic. Through Google's algorithm and Stack Overflow's votes, they will probably land on an explanation that is widely regarded as one of the best online, so probably better than what I'm about to offer.
Remember that this proposal is not for a new feature, but for a CHANGE to existing behaviour. So there'll be lots of code out there that happily iterates over strings. Also remember that iterating does not only mean "for x in y", but can be seen in various other forms: a, b, c = "foo" list("foo") dict.fromkeys("spam", "ham") Getting a warning from one of these lines of code will not necessarily explain the problem nor the correct solution.
So in a nutshell:
1. I think that users being confused about this inconsistency is not going to harm them as much as you seem to be claiming. 2. I think that users will learn to understand the reasoning behind inconsistency from practical experience, more than any explanation.
Even when there is a wealth of code out there that will become broken by this change? It's creating inconsistency across data types within a single Python version AND creating temporal inconsistency across Python versions with a single operation.
By contrast, the examples you've given are definitely harmful and difficult to understand. I've been personally baffled by the first example. But the real problem with them is deeper than inconsistency. The problem is that they are surprising, confusing, and difficult to debug. Which is exactly the kind of bug that the proposed warning is meant to defend against, and definitely not the kind of problem that will be caused by seeing said warning and contemplating inconsistencies in the data model.
The JS one gets even more fun when you consider that arrow functions don't behave that way. Believe me, these kinds of inconsistencies DO crop up, and they DO impact my day-to-day life (not the PHP one fortunately, I left that abomination behind years ago - but the JS function issue comes up at least every few weeks). Don't underestimate the cost of this kind of flaw, where making an insignificant change suddenly breaks the code. ChrisA
On Mon, Feb 24, 2020 at 9:12 PM Chris Angelico <rosuav@gmail.com> wrote:
On Tue, Feb 25, 2020 at 5:52 AM Alex Hall <alex.mojaki@gmail.com> wrote:
Of course consistency is valuable. I am asking how it is automatically
more valuable than "better or worse", which isn't an idea that makes sense to me. Consistency isn't axiomatically valuable, it's valuable for reasons such as what you've explained which ultimately boil down to what's better or worse. Those reasons can be outweighed by other considerations. Consistency is not *infinitely* valuable.
Thing is, "better or worse" could mean an extremely marginal difference, whereas "consistent or inconsistent" is a fairly sharp distinction. So it's a question of HOW MUCH worse it's okay to be, to avoid being inconsistent. And actually it's okay to be quite a lot worse in a vacuum, if it's better for consistency.
This response honestly seems to ignore most of the paragraph that it's responding to. It being a sharp distinction doesn't matter because consistency isn't axiomatically valuable. Its value has to be justified in context by real consequences, like with the JS example. Saying "it's okay to be quite a lot worse if it's better for consistency" is just blindly worshipping consistency. I don't know what "worse in a vacuum" means. Also remember that iterating does not only mean "for x in y", but can
be seen in various other forms:
a, b, c = "foo" list("foo") dict.fromkeys("spam", "ham")
Getting a warning from one of these lines of code will not necessarily explain the problem nor the correct solution.
Which is why I suggested a comprehensive message like: "Strings are not iterable - you cannot loop over them or treat them as a collection. Perhaps you meant to use string.chars(), string.split(), or string.splitlines()?" We can reword this or expand it to include more cases, it would definitely require some careful thought. I think mentioning unpacking is important, particularly for when someone forgets to use .items() when iterating over a dict. And no, it wouldn't be perfect, but trying to figure out the problem when you've got an exact line of code and a specific googleable message that provides multiple possible causes is a much less harmful experience than trying to understand why your code is silently acting bonkers.
Even when there is a wealth of code out there that will become broken by this change?
Not broken, just noisy.
It's creating inconsistency across data types within a single Python version AND creating temporal inconsistency across Python versions with a single operation.
As I said before, this holds no weight unless you expand on some actual consequences of these inconsistencies.
On Mon, Feb 24, 2020 at 4:08 PM Alex Hall <alex.mojaki@gmail.com> wrote:
Even when there is a wealth of code out there that will become broken
by this change?
Not broken, just noisy.
Noisy *IS* broken! In some ways it's one of the worst kinds of broken. Working with libraries that import other libraries and trigger warnings that have nothing to do with MY code is super annoying. I know this is a thorny question, since too-quiet means things don't change, and too noisy means users get warnings they are helpless to deal with. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On Tue, Feb 25, 2020 at 8:16 AM David Mertz <mertz@gnosis.cx> wrote:
On Mon, Feb 24, 2020 at 4:08 PM Alex Hall <alex.mojaki@gmail.com> wrote:
Even when there is a wealth of code out there that will become broken by this change?
Not broken, just noisy.
Noisy *IS* broken!
In some ways it's one of the worst kinds of broken. Working with libraries that import other libraries and trigger warnings that have nothing to do with MY code is super annoying. I know this is a thorny question, since too-quiet means things don't change, and too noisy means users get warnings they are helpless to deal with.
Exactly. Also, there are two possible futures: either it eventually becomes an error (in which case it will be broken in a more direct sense), or it remains permanently a noise (in which case the correct thing to do will be to silence the silly meaningless warning and just go about your daily life). Take your pick - which one are you advocating? ChrisA
Just add the appropriate code to filter that category of warnings. I think you have the option of two lines of Python code or one environment variable. On Mon, Feb 24, 2020 at 11:16 PM David Mertz <mertz@gnosis.cx> wrote:
On Mon, Feb 24, 2020 at 4:08 PM Alex Hall <alex.mojaki@gmail.com> wrote:
Even when there is a wealth of code out there that will become broken
by this change?
Not broken, just noisy.
Noisy *IS* broken!
In some ways it's one of the worst kinds of broken. Working with libraries that import other libraries and trigger warnings that have nothing to do with MY code is super annoying. I know this is a thorny question, since too-quiet means things don't change, and too noisy means users get warnings they are helpless to deal with.
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
On 2/24/2020 4:20 PM, Alex Hall wrote:
Just add the appropriate code to filter that category of warnings. I think you have the option of two lines of Python code or one environment variable.
I can assure you that many users will not know how to do either of those things. Eric
On Mon, Feb 24, 2020 at 11:16 PM David Mertz <mertz@gnosis.cx <mailto:mertz@gnosis.cx>> wrote:
On Mon, Feb 24, 2020 at 4:08 PM Alex Hall <alex.mojaki@gmail.com <mailto:alex.mojaki@gmail.com>> wrote:
Even when there is a wealth of code out there that will become broken by this change?
Not broken, just noisy.
Noisy *IS* broken!
In some ways it's one of the worst kinds of broken. Working with libraries that import other libraries and trigger warnings that have nothing to do with MY code is super annoying. I know this is a thorny question, since too-quiet means things don't change, and too noisy means users get warnings they are helpless to deal with.
-- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/3CMQ6Z... Code of Conduct: http://python.org/psf/codeofconduct/
On 24/02/2020 21:07, Alex Hall wrote:
This response honestly seems to ignore most of the paragraph that it's responding to. It being a sharp distinction doesn't matter because consistency isn't axiomatically valuable.
Actually I think it is. Or more precisely, I think inconsistency is axiomatically has negative value. An inconsistency breaks your expectation of how a language works. Each inconsistency creates a special case that you simply have to learn in order to use the language. The more inconsistencies you have, the more of those exceptions you have to know, and the harder the language is to learn. Just consider how hard English is to learn as an adult, and notice just how much of the language is inconsistency after inconsistency. -- Rhodri James *-* Kynesim Ltd
Is there a reason mypy could not assume that all AtomicStr methods that return strings actually return an AtomicStr, without impacting runtime behavior...? Maybe it's not possible and I'm just not familiar enough with the behavior of the type checkers.
I don't know but I could say that being problematic if parts of a project expects strings to be iterable and some expect them to atomic. If mypy assumes `isinstance(obj, Iterable)` returns false on `str` then its not really helping in the case where `obj: Union[str, Iterable[str]]` And while I don't really know much about mypy, I do know it understands stuff like `if isisnstance`, it seems like it would take tremendous hackery to get it to understand that when `isinstance(obj, Iterable)` returns True, you still can't pass that object to a function that consumes an iterable without also checking `not isinstance(obj, (str, bytes))`. assert """
In practice this would be a very odd decision given that the definition of Iterable is "has an __iter__". And there are plenty of times people will find the resulting behavior surprising since str DOES have an __iter__ method and there are plenty of times you might want to iterate on sequences and strs in the same context.
""" in set_of_draw_backs On Tue, Feb 25, 2020 at 4:28 AM Rhodri James <rhodri@kynesim.co.uk> wrote:
On 24/02/2020 21:07, Alex Hall wrote:
This response honestly seems to ignore most of the paragraph that it's responding to. It being a sharp distinction doesn't matter because consistency isn't axiomatically valuable.
Actually I think it is. Or more precisely, I think inconsistency is axiomatically has negative value. An inconsistency breaks your expectation of how a language works. Each inconsistency creates a special case that you simply have to learn in order to use the language. The more inconsistencies you have, the more of those exceptions you have to know, and the harder the language is to learn. Just consider how hard English is to learn as an adult, and notice just how much of the language is inconsistency after inconsistency.
-- Rhodri James *-* Kynesim Ltd _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/5ZIK4E... Code of Conduct: http://python.org/psf/codeofconduct/
On Mon, Feb 24, 2020 at 3:01 PM Alex Hall <alex.mojaki@gmail.com> wrote:
Snip
From the assumptions in this scenario, we're talking about a beginner - specifically one who might have trouble understanding the kinds of things we're discussing, and who has never iterated over a string before (which, if I am to be generous to your side, is supposedly a common activity).
Context: I recall a conversation on the edu-sig group (Python in education) about teaching beginners. I was arguing that, when teaching in visual-based environment (think Karel the robot or the turtle module), when introducing loops, it would be useful to have something like "repeat n: move()" instead of "for irrelevant_name in range(n): move()". (This is something I have implemented in Reeborg's world and that has also been independently added in TygerJython) Someone (I believe it was Laura Creighton - sorry, I cannot find the link right now) mentioned that the very first example of loops they used was something like the following: for letter in "some word": print(letter) If I recall correctly, quite a few other people teaching beginners also mentioned that this was one of the first, if not the first example they used. In fact, I think I was in the minority in not using this type of iteration over strings as an early example of loops. So, I would argue that iterating over strings with beginners is something much more common than what you appear to believe. André Roberge
I'm currently creating my own FOSS platform/course for teaching Python to beginners: https://github.com/alexmojaki/python_init (sorry, it's very new and under construction so there's no docs yet). The content moves to for loops relatively quickly, and for quite a while the only thing they loop over is strings. It's a great way to make all sorts of fun exercise and build foundations. So I'm aware of its educational value, and I know this would be a pity in that sense. I would simply update my course to start with: for letter in "some word".chars(): print(letter) and explain that ".chars() simply means that you want the characters of the string - don't worry about what the . and () mean for now" (although I'm against requiring () anyway). When they inevitably forget to include .chars() somewhere, they will immediately get a warning telling them to put it back, and since they've seen chars() used before, the warning will be easy enough to understand. By the time they see indexing used on strings in my course, chars() will be an old friend. The journey I've presented is one where the user hasn't already been shown iteration over a string and discovers it for themselves, because that is the scenario which seems most likely to cause confusion and thus it's the most favourable to Chris' argument and the most damaging to mine. On Mon, Feb 24, 2020 at 9:22 PM André Roberge <andre.roberge@gmail.com> wrote:
On Mon, Feb 24, 2020 at 3:01 PM Alex Hall <alex.mojaki@gmail.com> wrote:
Snip
From the assumptions in this scenario, we're talking about a beginner - specifically one who might have trouble understanding the kinds of things we're discussing, and who has never iterated over a string before (which, if I am to be generous to your side, is supposedly a common activity).
Context: I recall a conversation on the edu-sig group (Python in education) about teaching beginners. I was arguing that, when teaching in visual-based environment (think Karel the robot or the turtle module), when introducing loops, it would be useful to have something like "repeat n: move()" instead of "for irrelevant_name in range(n): move()". (This is something I have implemented in Reeborg's world and that has also been independently added in TygerJython)
Someone (I believe it was Laura Creighton - sorry, I cannot find the link right now) mentioned that the very first example of loops they used was something like the following:
for letter in "some word": print(letter)
If I recall correctly, quite a few other people teaching beginners also mentioned that this was one of the first, if not the first example they used. In fact, I think I was in the minority in not using this type of iteration over strings as an early example of loops.
So, I would argue that iterating over strings with beginners is something much more common than what you appear to believe.
André Roberge
On Mon, 24 Feb 2020 at 16:23, Alex Hall <alex.mojaki@gmail.com> wrote:
Roughly, though I think you might be hearing me wrong. There is lots of existing code that correctly and intentionally iterates over strings. And code that unintentionally does it probably doesn't live for long. But if you took a random sample of all the times that someone has written code that creates new behaviour which iterates over a string, most of them would be mistakes. And essentially the developer was 'wrong' in those instances. In my case, since I can't think of when I've needed to iterate over a string, I've probably been wrong at least 90% of the time.
Conversely, I can't remember a case where I've ever accidentally iterated over a string when I meant not to. I *can* remember many times when I've relied on strings being iterable. Me quoting my experience is neither more nor less valuable than you quoting yours. However, mine aligns with the current behaviour of Python, and keeping things as they are so that my experience doesn't change has no backward compatibility implications. So I don't need to justify a preference that the language does what suits me best :-) I don't think (I haven't checked the thread closely) that anyone is saying your experience/expectation is "wrong". But to be sufficient to result in a change in the language, you have to establish that your behaviour is significantly *better* than the status quo, and I don't think that you're doing that at the moment. And IMO, you're never likely to do so simply by quoting numbers of people who do or don't prefer the current behaviour - compelling arguments are typically around demonstrating how much code would be demonstrably better with the new behaviour, along with showing that code that is detrimentally affected has an easy workaround. Your `.chars()` proposal targets the latter question, but neither you, nor anyone else in past iterations of this discussion, have yet come up with anything persuasive for the former, that I'm aware of. Paul
Paul Moore wrote:
On Mon, 24 Feb 2020 at 16:23, Alex Hall alex.mojaki@gmail.com wrote: you have to establish that your behaviour is significantly better than the status quo, and I don't think that you're doing that at the moment. And IMO, you're never likely to do so simply by quoting numbers of people who do or don't prefer the current behaviour - compelling arguments are typically around demonstrating how much code would be demonstrably better with the new behaviour, along with showing that code that is detrimentally affected has an easy workaround. Your .chars() proposal targets the latter question, but neither you, nor anyone else in past iterations of this discussion, have yet come up with anything persuasive for the former, that I'm aware of. Paul
Yeah, I tiresomely agree (can I say that in English?) Once again... we need evidence: real workable surprising code! Please...
Conversely, I can't remember a case where I've ever accidentally iterated over a string when I meant not to.
Do you ever return a string from a function where you should have returned a list containing one string? Or similarly passed a string to a function? Forgotten to put a trailing comma in a singleton tuple? Forgotten to add .items() to `for key, value in kwargs:`?
compelling arguments are typically around demonstrating how much code would be demonstrably better with the new behaviour
That represents a misunderstanding of my position. I think I'm an outlier among the advocates in this thread, but I do not believe that implementing any of the ideas in this proposal would significantly affect code that lives in the long term. Some code would become slightly better, some slightly worse. My concern surrounds the user experience when debugging code that accidentally iterates over a string. So it's impossible for me to show you code that becomes significantly better because that's not what I'm arguing about, and it's unfair to say that quoting people who have struggled with these bugs is not evidence for the problem. Similarly for jdveiga, I cannot give you "real workable surprising code" because I'm talking about code that isn't workable as a result of being surprising. I have given examples of real non-working surprising code, and I can give more, and if that's not indicative of a real problem then I'm very confused and would appreciate more explanation. On Mon, Feb 24, 2020 at 7:11 PM Paul Moore <p.f.moore@gmail.com> wrote:
Roughly, though I think you might be hearing me wrong. There is lots of existing code that correctly and intentionally iterates over strings. And code that unintentionally does it probably doesn't live for long. But if you took a random sample of all the times that someone has written code
On Mon, 24 Feb 2020 at 16:23, Alex Hall <alex.mojaki@gmail.com> wrote: that creates new behaviour which iterates over a string, most of them would be mistakes. And essentially the developer was 'wrong' in those instances. In my case, since I can't think of when I've needed to iterate over a string, I've probably been wrong at least 90% of the time.
Conversely, I can't remember a case where I've ever accidentally iterated over a string when I meant not to. I *can* remember many times when I've relied on strings being iterable.
Me quoting my experience is neither more nor less valuable than you quoting yours. However, mine aligns with the current behaviour of Python, and keeping things as they are so that my experience doesn't change has no backward compatibility implications. So I don't need to justify a preference that the language does what suits me best :-)
I don't think (I haven't checked the thread closely) that anyone is saying your experience/expectation is "wrong". But to be sufficient to result in a change in the language, you have to establish that your behaviour is significantly *better* than the status quo, and I don't think that you're doing that at the moment. And IMO, you're never likely to do so simply by quoting numbers of people who do or don't prefer the current behaviour - compelling arguments are typically around demonstrating how much code would be demonstrably better with the new behaviour, along with showing that code that is detrimentally affected has an easy workaround. Your `.chars()` proposal targets the latter question, but neither you, nor anyone else in past iterations of this discussion, have yet come up with anything persuasive for the former, that I'm aware of.
Paul
Alex Hall wrote:
Conversely, I can't remember a case where I've ever accidentally iterated over a string when I meant not to. Do you ever return a string from a function where you should have returned a list containing one string? Or similarly passed a string to a function? Forgotten to put a trailing comma in a singleton tuple? Forgotten to add .items() to for key, value in kwargs:? compelling arguments are typically around demonstrating how much code would be demonstrably better with the new behaviour That represents a misunderstanding of my position. I think I'm an outlier among the advocates in this thread, but I do not believe that implementing any of the ideas in this proposal would significantly affect code that lives in the long term. Some code would become slightly better, some slightly worse. My concern surrounds the user experience when debugging code that accidentally iterates over a string. So it's impossible for me to show you code that becomes significantly better because that's not what I'm arguing about, and it's unfair to say that quoting people who have struggled with these bugs is not evidence for the problem. Similarly for jdveiga, I cannot give you "real workable surprising code" because I'm talking about code that isn't workable as a result of being surprising. I have given examples of real non-working surprising code, and I can give more, and if that's not indicative of a real problem then I'm very confused and would appreciate more explanation. On Mon, Feb 24, 2020 at 7:11 PM Paul Moore p.f.moore@gmail.com wrote: On Mon, 24 Feb 2020 at 16:23, Alex Hall alex.mojaki@gmail.com wrote: Roughly, though I think you might be hearing me wrong. There is lots of existing code that correctly and intentionally iterates over strings. And code that unintentionally does it probably doesn't live for long. But if you took a random sample of all the times that someone has written code that creates new behaviour which iterates over a string, most of them would be mistakes. And essentially the developer was 'wrong' in those instances. In my case, since I can't think of when I've needed to iterate over a string, I've probably been wrong at least 90% of the time. Conversely, I can't remember a case where I've ever accidentally iterated over a string when I meant not to. I can remember many times when I've relied on strings being iterable. Me quoting my experience is neither more nor less valuable than you quoting yours. However, mine aligns with the current behaviour of Python, and keeping things as they are so that my experience doesn't change has no backward compatibility implications. So I don't need to justify a preference that the language does what suits me best :-) I don't think (I haven't checked the thread closely) that anyone is saying your experience/expectation is "wrong". But to be sufficient to result in a change in the language, you have to establish that your behaviour is significantly better than the status quo, and I don't think that you're doing that at the moment. And IMO, you're never likely to do so simply by quoting numbers of people who do or don't prefer the current behaviour - compelling arguments are typically around demonstrating how much code would be demonstrably better with the new behaviour, along with showing that code that is detrimentally affected has an easy workaround. Your .chars() proposal targets the latter question, but neither you, nor anyone else in past iterations of this discussion, have yet come up with anything persuasive for the former, that I'm aware of. Paul
If you can provide a real code of strings wrongdoing, I will be convinced. On the contrary, you have provided two examples --as long as I can remember-- on the expected and implemented behaviour of strings in Python. Nothing wrong on language implementation. Just your desire that things work in a different manner -- but this is not suffice to change the foundations of any programming language: start your own language if you feel so disappointed; Guido did. I am really eager to be convinced. Please, show us a snippet that proves your point of view. If you cannot, accept that Python's string model is just a convention and that programming languages are purely conventional. Computation is not about Python, or Lisp, or Java, is about algorithms.
On Mon, 24 Feb 2020 at 20:13, Alex Hall <alex.mojaki@gmail.com> wrote:
Conversely, I can't remember a case where I've ever accidentally iterated over a string when I meant not to.
Do you ever return a string from a function where you should have returned a list containing one string? Or similarly passed a string to a function? Forgotten to put a trailing comma in a singleton tuple? Forgotten to add .items() to `for key, value in kwargs:`?
Not that I remember - that's what I said, basically. No, I'm not perfect (far from it!) but I don't recall ever hitting this issue.
compelling arguments are typically around demonstrating how much code would be demonstrably better with the new behaviour
That represents a misunderstanding of my position. I think I'm an outlier among the advocates in this thread, but I do not believe that implementing any of the ideas in this proposal would significantly affect code that lives in the long term. Some code would become slightly better, some slightly worse.
I beg to differ. * Code that chooses to use `.chars()` would fail to work on versions of Python before whatever version implemented this (3.9? 3.10?). That makes it effectively unusable in libraries for years to come. * If you make iterating over strings produce a warning before `.chars()` is available as an option for any code that would be affected, you're inflicting a warning on all of that code. * A warning that will never become an error is (IMO) unacceptable. It's making it annoying to use a particular construct, but with no intention of ever doing anything beyond annoying people into doing what you want them to do. * A warning that *will* become an error just delays the problem - let's assume we're discussing the point when it becomes an error. As a maintainer of pip, which currently still supports Python 2.7, and which will support versions of Python earlier than 3.9 for years yet, I'd appreciate it if you would explain what pip should do about this proposed change. (Note: if you suggest just suppressing the warning, I'll counter by asking you why we'd ever remove the code to suppress the warning, and in that case what's the point of it?) And pip is an application, so easier. What about the `packaging` library? What should that do? In that case, modifying global state (the warning config) when the library is imported is generally considered bad form, so how do we protect our users from this warning being triggered by our code? Again, we won't be able to use `.chars()` for years. Boilerplate like if sys.version_info >= (3, 9): def chars(s): return s.chars() else: def chars(s): return s would be an option, but that's a lot of clutter for every project to add for something that *isn't a problem* - remember, long-running, well-maintained libraries with a broad user base will likely have already flushed out any bugs that might result from accidentally iterating over strings. And these days, projects often use mypy which will catch such errors as well. So this is literally useless boilerplate for them.
My concern surrounds the user experience when debugging code that accidentally iterates over a string. So it's impossible for me to show you code that becomes significantly better because that's not what I'm arguing about, and it's unfair to say that quoting people who have struggled with these bugs is not evidence for the problem.
OK. That's a fair point. But why can't we find other approaches? Type checking with mypy would catch returning a string when it should be a list of strings. Same with all of your other examples above. How was your experience suggesting mypy for this type of problem? I suspect that, as you are talking about beginners, you didn't inflict anything that advanced on them - is there anything that could be done to make mypy more useful in a beginner context?
I would like to reiterate a point that I think is very important and many people seem to be brushing aside. We don't have to *break* existing code. We can get a lot of value, at least in terms of aiding debugging, just by adding a warning.
Years of experience maintaining libraries and applications have convinced me that warnings can cause as much "breakage" as any other change. Just saying "you can suppress them" doesn't make the problem go away. And warnings that are suppressed by default are basically pointless, as people just ignore them. That's not to say that warnings are useless - just that introducing *new* warnings needs to be treated just as seriously as any other change. Paul
I agree, a warning that is never converted to an error indicates that this is more about style than behavior (and in that sense it is use case specific). It would also be annoying for people that intentionally iterate over strings and find this a useful feature. So this sounds more like the job for a linter or, because it's dealing with types, a type checker. So what about the compromise that for example mypy added a flag to treat strings as atomic, i.e. then it would flag usage of strings where an iterable or a sequence is expected. Would that solve the problem? On 24.02.20 23:31, Paul Moore wrote:
Conversely, I can't remember a case where I've ever accidentally iterated over a string when I meant not to. Do you ever return a string from a function where you should have returned a list containing one string? Or similarly passed a string to a function? Forgotten to put a trailing comma in a singleton tuple? Forgotten to add .items() to `for key, value in kwargs:`? Not that I remember - that's what I said, basically. No, I'm not
On Mon, 24 Feb 2020 at 20:13, Alex Hall <alex.mojaki@gmail.com> wrote: perfect (far from it!) but I don't recall ever hitting this issue.
compelling arguments are typically around demonstrating how much code would be demonstrably better with the new behaviour That represents a misunderstanding of my position. I think I'm an outlier among the advocates in this thread, but I do not believe that implementing any of the ideas in this proposal would significantly affect code that lives in the long term. Some code would become slightly better, some slightly worse. I beg to differ.
* Code that chooses to use `.chars()` would fail to work on versions of Python before whatever version implemented this (3.9? 3.10?). That makes it effectively unusable in libraries for years to come. * If you make iterating over strings produce a warning before `.chars()` is available as an option for any code that would be affected, you're inflicting a warning on all of that code. * A warning that will never become an error is (IMO) unacceptable. It's making it annoying to use a particular construct, but with no intention of ever doing anything beyond annoying people into doing what you want them to do. * A warning that *will* become an error just delays the problem - let's assume we're discussing the point when it becomes an error.
As a maintainer of pip, which currently still supports Python 2.7, and which will support versions of Python earlier than 3.9 for years yet, I'd appreciate it if you would explain what pip should do about this proposed change. (Note: if you suggest just suppressing the warning, I'll counter by asking you why we'd ever remove the code to suppress the warning, and in that case what's the point of it?)
And pip is an application, so easier. What about the `packaging` library? What should that do? In that case, modifying global state (the warning config) when the library is imported is generally considered bad form, so how do we protect our users from this warning being triggered by our code? Again, we won't be able to use `.chars()` for years.
Boilerplate like
if sys.version_info >= (3, 9): def chars(s): return s.chars() else: def chars(s): return s
would be an option, but that's a lot of clutter for every project to add for something that *isn't a problem* - remember, long-running, well-maintained libraries with a broad user base will likely have already flushed out any bugs that might result from accidentally iterating over strings. And these days, projects often use mypy which will catch such errors as well. So this is literally useless boilerplate for them.
My concern surrounds the user experience when debugging code that accidentally iterates over a string. So it's impossible for me to show you code that becomes significantly better because that's not what I'm arguing about, and it's unfair to say that quoting people who have struggled with these bugs is not evidence for the problem. OK. That's a fair point. But why can't we find other approaches? Type checking with mypy would catch returning a string when it should be a list of strings. Same with all of your other examples above. How was your experience suggesting mypy for this type of problem? I suspect that, as you are talking about beginners, you didn't inflict anything that advanced on them - is there anything that could be done to make mypy more useful in a beginner context?
I would like to reiterate a point that I think is very important and many people seem to be brushing aside. We don't have to *break* existing code. We can get a lot of value, at least in terms of aiding debugging, just by adding a warning. Years of experience maintaining libraries and applications have convinced me that warnings can cause as much "breakage" as any other change. Just saying "you can suppress them" doesn't make the problem go away. And warnings that are suppressed by default are basically pointless, as people just ignore them. That's not to say that warnings are useless - just that introducing *new* warnings needs to be treated just as seriously as any other change.
Paul _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/5IIVW5... Code of Conduct: http://python.org/psf/codeofconduct/
So I do not have extensive experience with mypy but I don't see how it would help. The entire issue that is that `str` is an instance of `Iterable[str]` so how is mypy going to catch my error of passing a single string instead of an iterable of strings to a function? However the ability to distinguish between `str` and `Iterable[str]` is important. I often write functions that operate on scalar or iterable of scalar and I always need to special case `str`. Now you might argue that my life would be simplified if I required everything just be an iterable and you would be right. However, it would still leave me with strange errors when passing single strings and would decrease usability. Consider the behavior of `__slots__`, I for one think its great that `__slots__ = "foo"` does what it does. I think it would be bad from usability standpoint to require a trailing comma, even if it would simplify the life of the interpreter. I like the idea of an `AtomicString` but it doesn't really help unless IO and string literals return `AtomicString` instead of `str` (which is just as breaking as changing `str`). I agree that noisy is broken and that a warning that never becomes an error is probably a bad idea. While I am firmly in the camp that this is a problem I am not sure if its a problem that should be fixed. Any solution will break code, there is no way around it. However, I think one possible solution (that has it own set of draw backs) would be to simply not register `str` as subclass of `Iterable`. This would allow for the following pattern: ``` if isinstance(obj, Iterable): # iterable case else: # scalar case ``` Where currently the following is required: ``` if isinstance(obj, Iterable) and not isinstance(obj, (str, bytes)): # iterable case else: # scalar case ``` Yes this would break code, but it would break a lot less code than actually changing the behavior of `str`. On Mon, Feb 24, 2020 at 3:16 PM Dominik Vilsmeier <dominik.vilsmeier@gmx.de> wrote:
I agree, a warning that is never converted to an error indicates that this is more about style than behavior (and in that sense it is use case specific). It would also be annoying for people that intentionally iterate over strings and find this a useful feature.
So this sounds more like the job for a linter or, because it's dealing with types, a type checker. So what about the compromise that for example mypy added a flag to treat strings as atomic, i.e. then it would flag usage of strings where an iterable or a sequence is expected. Would that solve the problem?
Conversely, I can't remember a case where I've ever accidentally iterated over a string when I meant not to. Do you ever return a string from a function where you should have returned a list containing one string? Or similarly passed a string to a function? Forgotten to put a trailing comma in a singleton tuple? Forgotten to add .items() to `for key, value in kwargs:`? Not that I remember - that's what I said, basically. No, I'm not
On Mon, 24 Feb 2020 at 20:13, Alex Hall <alex.mojaki@gmail.com> wrote: perfect (far from it!) but I don't recall ever hitting this issue.
compelling arguments are typically around demonstrating how much code would be demonstrably better with the new behaviour That represents a misunderstanding of my position. I think I'm an outlier among the advocates in this thread, but I do not believe that implementing any of the ideas in this proposal would significantly affect code that lives in the long term. Some code would become slightly better, some slightly worse. I beg to differ.
* Code that chooses to use `.chars()` would fail to work on versions of Python before whatever version implemented this (3.9? 3.10?). That makes it effectively unusable in libraries for years to come. * If you make iterating over strings produce a warning before `.chars()` is available as an option for any code that would be affected, you're inflicting a warning on all of that code. * A warning that will never become an error is (IMO) unacceptable. It's making it annoying to use a particular construct, but with no intention of ever doing anything beyond annoying people into doing what you want them to do. * A warning that *will* become an error just delays the problem - let's assume we're discussing the point when it becomes an error.
As a maintainer of pip, which currently still supports Python 2.7, and which will support versions of Python earlier than 3.9 for years yet, I'd appreciate it if you would explain what pip should do about this proposed change. (Note: if you suggest just suppressing the warning, I'll counter by asking you why we'd ever remove the code to suppress the warning, and in that case what's the point of it?)
And pip is an application, so easier. What about the `packaging` library? What should that do? In that case, modifying global state (the warning config) when the library is imported is generally considered bad form, so how do we protect our users from this warning being triggered by our code? Again, we won't be able to use `.chars()` for years.
Boilerplate like
if sys.version_info >= (3, 9): def chars(s): return s.chars() else: def chars(s): return s
would be an option, but that's a lot of clutter for every project to add for something that *isn't a problem* - remember, long-running, well-maintained libraries with a broad user base will likely have already flushed out any bugs that might result from accidentally iterating over strings. And these days, projects often use mypy which will catch such errors as well. So this is literally useless boilerplate for them.
My concern surrounds the user experience when debugging code that accidentally iterates over a string. So it's impossible for me to show you code that becomes significantly better because that's not what I'm arguing about, and it's unfair to say that quoting people who have struggled with
On 24.02.20 23:31, Paul Moore wrote: these bugs is not evidence for the problem.
OK. That's a fair point. But why can't we find other approaches? Type checking with mypy would catch returning a string when it should be a list of strings. Same with all of your other examples above. How was your experience suggesting mypy for this type of problem? I suspect that, as you are talking about beginners, you didn't inflict anything that advanced on them - is there anything that could be done to make mypy more useful in a beginner context?
I would like to reiterate a point that I think is very important and many people seem to be brushing aside. We don't have to *break* existing code. We can get a lot of value, at least in terms of aiding debugging, just by adding a warning. Years of experience maintaining libraries and applications have convinced me that warnings can cause as much "breakage" as any other change. Just saying "you can suppress them" doesn't make the problem go away. And warnings that are suppressed by default are basically pointless, as people just ignore them. That's not to say that warnings are useless - just that introducing *new* warnings needs to be treated just as seriously as any other change.
Paul _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/5IIVW5... Code of Conduct: http://python.org/psf/codeofconduct/
Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/ZWHXOC... Code of Conduct: http://python.org/psf/codeofconduct/
So I do not have extensive experience with mypy but I don't see how it would help. The entire issue that is that `str` is an instance of `Iterable[str]` so how is mypy going to catch my error of passing a single string instead of an iterable of strings to a function?
...
I like the idea of an `AtomicString` but it doesn't really help unless IO and string literals return `AtomicString` instead of `str` (which is just as breaking as changing `str`).
Is there a reason mypy could not assume that all AtomicStr methods that return strings actually return an AtomicStr, without impacting runtime behavior...? Maybe it's not possible and I'm just not familiar enough with the behavior of the type checkers. I agree that noisy is broken and that a warning that never becomes an error
is probably a bad idea.
While I am firmly in the camp that this is a problem I am not sure if its a problem that should be fixed. Any solution will break code, there is no way around it. However, I think one possible solution (that has it own set of draw backs) would be to simply not register `str` as subclass of `Iterable`. This would allow for the following pattern:
``` if isinstance(obj, Iterable): # iterable case else: # scalar case ```
Where currently the following is required:
``` if isinstance(obj, Iterable) and not isinstance(obj, (str, bytes)): # iterable case else: # scalar case ```
Yes this would break code, but it would break a lot less code than actually changing the behavior of `str`.
In practice this would be a very odd decision given that the definition of Iterable is "has an __iter__". And there are plenty of times people will find the resulting behavior surprising since str DOES have an __iter__ method and there are plenty of times you might want to iterate on sequences and strs in the same context. But on the surface at least this seems like it could be much easier to implement than all the AtomicStr stuff. Of course you are correct it will break statically typed code, but that really seems like a much easier pill to swallow than breaking runtime code. I think it's an idea worth exploring.
On Sun, Feb 23, 2020 at 10:17:03PM -0000, Alex Hall wrote:
Steven D'Aprano wrote:
Conceptually, we should be able to reason that every object that supports indexing should be iterable, without adding a special case exception "...except for str".
Strings already have an exception in this area. Usually `x in y` means `any(x == elem for elem in y)`.
I don't think that there is anything specific to the `in` operator that demands that it is implemented that way. It is certainly a reasonable default implementation (I think Ruby offers it as method on a base class) but the `in` operator conceptually provides a containment test which might be far more general than the above: - number in interval - location in region - subset in set - subtree in tree - offence in criminal_record - monster in room just off the top of my head. In the case of strings, if the argument is a length-1 string, then your implementation works. If it isn't, a generalisation of it works: rather than iterate over substrings of length 1, iterate over substrings of the same length as the argument. So conceptually we have the string `in` operator being equivalent to: any(arg == sub for sub in thestring) except that iteration is not necessarily in length-1 chunks. (Note that for efficiency reasons, string containment testing may not actually be implemented in that way.)
Another somewhat related example: we usually accept that basically every object can be treated as a boolean, even more so of it has a `__len__`. But numpy and pandas break this 'rule' by raising an exception if you try to treat an array as a boolean
Yes, they do, and I think they are wrong to have done so. But they had their reasons, and its hard to know whether any other alternative would have been better. -- Steven
Steve Jorgensen wrote:
The only change I am proposing is that the iterability for characters in a string be moved from the string object itself to a view that is returned from a chars() method of the string. Eventually, direct iteratability would be deprecated and then removed.
That sounds completely unnecessary to me. Can you provide any example on the superiority of your proposal (compared to current implementation)? Sorry, but I can see your point. Why do we need to convert a sequence in another sequence or sequence-like object? Where is the advantage? Or are you proposing that strings will not be sequences in Python any longer? Why? How? Are they going to be .... what? In my point of view, there is nothing wrong in modelling strings as sequence (or as lists like Lisp, or arrays like C). It has advantages and caveats... such as any alternative. However, I think that strings as immutable sequences of characters is a nice and efficient solution in Python context (like null-terminated arrays are in C). Really... you need to provide a example on what is wrong with strings in Python now and how you propose to solve that.
On Feb 23, 2020, at 12:52, Steve Jorgensen <stevej@stevej.name> wrote:
The only change I am proposing is that the iterability for characters in a string be moved from the string object itself to a view that is returned from a `chars()` method of the string. Eventually, direct iteratability would be deprecated and then removed.
I do not want indexing behavior to be moved, removed, or altered, and I am not suggesting that it would/should be.
That would be very weird. Something that acts like a sequence in every way—indexing, slicing, Sequence methods like count, other methods that return indices, etc.—except that it isn’t Iterable doesn’t feel like Python. Python even lets you iterate over even “old-style semi-sequences”, things which define __getitem__ to work with a contiguous sequence starting from 0 until they raise IndexError. I think if you want to move iteration to chars, you’d want to move sequence behavior there too. Also, I think you’d still want the chars view to iterate a new char type rather than strs or chars views; otherwise you still have the infinite regress problem—it only shows up when you decide to explicitly recurse into str (iterate anything that’s iterable, and iterate chars() on anything that’s a str), but it’s just as bad as the current state when you do; there’s still no way to say “recursively iterate strings, but only down to characters, not infinitely”. I’m not sure I like the idea in any variation for Python, but a few more points in favor of it: The chars view would open the door for additional views on strings. See Swift, making you state explicitly whether you want to iterate UTF-8 code points (bytes), UTF-32 code points, or enhanced grapheme clusters, instead of just picking one and that’s what you get (and the other two require constructing some separate object that copies stuff from the string). After all, a string is an iterable of all of those things; the fact that it happens to be stored as an array of Latin bytes, UCS2 code units, or UTF-32 code points, with a cache of UTF-8 bytes, doesn’t force us to treat it as an iterable of UTF-32 code points; only legacy reasons do. And having a strutf8view could mean that in many apps, all bytes objects are binary data rather than encoded text, which makes the bytes type more semantically meaningful in those apps. It could also make bridging libraries to languages where strings aren’t iterable more reasonable. For example, IIRC, pyobjc NSString objects today have methods to iterate strs so they can ducktype as strings; if strings weren’t Iterable, they could be much closer to a trivial pure bridge to the ObjC type. Finally, a bunch of unicodedata functions and so on that are only make sense on single characters have to take str today and raise a ValueError or something if passed multiple characters. (There are even some Unicode functions that only make sense on single EGCs, but I think Python doesn’t provide any of them.) Passing a char object, you’d know statically that it makes sense; passing a str object, you don’t.
On Sun, 2020-02-23 at 20:32 +0000, jdveiga@gmail.com wrote:
Kyle Stanley wrote:
In order for this proposal to be seriously considered, I think it's necessary to cite many realistic examples where the current behavior is <snip> existing solutions, I strongly suspect that this is going to be the case of change that is far too fundamental to Python to be changed at this point, even with a distant deprecation warning and many years of advanced notice regarding the removal/change after that.
I agree. I can barely imagine what is wrong with Python's strings. Can you please provide any example?
The main issue is generic code. Sequence like lists and tuples are often iterated over: You work with the elements inside. However, in the same code, often strings are actually not desired to be handled as sequences. I.e. some code uses strings as sequences of characters, but most code probably uses the meaning that a string is a word, or sentence: the individual character has no useful meaning on its own.
It is a common "pattern" in any languages to walk along strings, letter by letter. Python's strings provide a powerful way of doing it --as a sequence which is a fundamental type in the language. No need to dealing with indexes, terminators, lengths, boundaries, etc. I love Python because of this and hate C's and Java's strings.
On the other hand, what about slices? Since slices operate in sequences and lists, if strings are not longer sequences, how would Python do slicing on strings according to your proposal?
I doubt anyone wants to touch slicing/indexing syntax, although that could lead to a mismatch in what we are used to. I.e. the string is not iterable?: for i in range(len(string)): character = string[i] might work where: for character in string: pass would tell you to use `for character in string.chars():` instead. If you make it a property rather than a function, you could think about also forcing `string.chars[0]` in the first case. Coming from NumPy, there is a subtle spectrum: * strings are primarily scalars for which sequence behaviour is well defined and often convenient * lists/tuples are typical collections/sequences * NumPy arrays are more element focused than most collections (operators modify the elements not the container). we also have some code that has to check for string explicitly in NumPy (it is not a big issue though). I am almost starting to wonder if a flag/ABC for each of the three cases could be useful: * isinstance(string, ScalarCollection) -> True * isinstance(list_, TypicalCollection) -> True * isinstance(NumPy_array, ElementwiseCollection) -> True to distinguish the spectrum. But, I doubt it myself for now. It is not duck-typing and in general what you want can be context dependent. Plus, likely strings are the main ScalarCollection in practice anyway. Polynomials come to mind, but there is only so many of those likely to mix. Best, Sebastian
I think strings as immutable strings is indeed a wise implementation decision on Python.
Thank you. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/PETKSM... Code of Conduct: http://python.org/psf/codeofconduct/
Sebastian Berg wrote:
However, in the same code, often strings are actually not desired to be handled as sequences. I.e. some code uses strings as sequences of characters, but most code probably uses the meaning that a string is a word, or sentence: the individual character has no useful meaning on its own.
Humm... It is not a matter of how often. It about the underlying type of strings and the derived behaviour. In Python, they are immutable sequences of characters. It does matter if you use them as characters, as words, as bibles... If you have a list of integers, does it behave differently if it contains a few countries' GDP, the songs in each Metallica's LPs, or the first two hundred primes? No, it does not. It is a list, a sequence, and you can deal with that list in the same ways as you can deal with a string sequence. Because they are both collections. This is fine in most cases and I like to know when it is not.
you make it a property rather than a function, you could think about also forcing string.chars[0] in the first case.
And once again... which is the superiority of strings.chars[0] over strings[0]? Maybe I am dumb, but I can see the difference.
Coming from NumPy, there is a subtle spectrum:
strings are primarily scalars for which sequence behaviour is well defined and often convenient lists/tuples are typical collections/sequences NumPy arrays are more element focused than most collections (operators modify the elements not the container).
Nice point. However, we are talking about Python, not NumPy. A nice solution for NumPy is not necessary a good one for Python. You can list lots of languages and libraries which do differently than Python but... it does not matter! The question is different: is it right that Python string are immutable sequences or not? Should we change this or not? Sorry, but to the date, I do not see any (Python) example that proves that Python strings need a change like this. We can discuss endless about strings nature and essence but, in my opinion, we need to focus on real code, on real flaws. Thank for your comments.
On 2/23/20 5:01 PM, Sebastian Berg wrote:
The main issue is generic code. Sequence like lists and tuples are often iterated over: You work with the elements inside.
However, in the same code, often strings are actually not desired to be handled as sequences. I.e. some code uses strings as sequences of characters, but most code probably uses the meaning that a string is a word, or sentence: the individual character has no useful meaning on its own.
And in this case I don't see the string test as that special. In fact, in most case I would think the structure is normally always built from a single type of sequence, (like always a list) and anything that isn't a list is a terminal (or maybe some other type of structure that would be iterated differently) so you wouldn't actually be checking for strings, but for lists (and iterating further on them). -- Richard Damon
On 24/02/20 9:32 am, jdveiga@gmail.com wrote:
It is a common "pattern" in any languages to walk along strings, letter by letter.
Maybe in some other languages, but I very rarely find myself doing that in Python. There is almost always some higher level way of doing what I want.
if strings are not longer sequences, how would Python do slicing on strings
It's quite possible for a type to support slicing but not indexing. -- Greg
Greg Ewing wrote:
It is a common "pattern" in any languages to walk along strings, letter by letter. Maybe in some other languages, but I very rarely find myself doing
On 24/02/20 9:32 am, jdveiga@gmail.com wrote: that in Python. There is almost always some higher level way of doing what I want.
Not talking about how frequent walking on strings is done. I just saying that most languages provide a "pattern" (say, a way) of looping on strings (via indexing, via iterators, etc.) So, most languages, in some manner, "use" strings as they are sequences of characters. Like Python does. Since Python has a sequence type, strings were modelled as character sequence-objects and so they behave. In your opinion, if strings should not be sequences in Python, what they should be?
if strings are not longer sequences, how would Python do slicing on strings It's quite possible for a type to support slicing but not indexing.
Yeah, of course... do you think that we need to change the current syntax and the underlying string model to accommodate slicing but not indexing? What are we going to gain and what are we going to lose?
I have fairly frequently written some kind of recursive descent into collections. Like many people, I've had to special case strings, which are pseudo-scalar, and I don't want to descend into. But one thing I don't think I've ever tripped over is descending into single characters, but then wanting those not to be iterable. I understands how code could hit that, but for me the scalar level has always been a whole string, not it's characters. Yes, of course I've looped over strings, but always with the knowledge there is no further descent. Changing strings is just too huge a change in Python semantics. But we could create a new type ScalarString or AtomicString that was "just like strong but not iterable." I'm not sure if it would subclass string, but it wouldn't directly change the string type either way. This new type does not need to live in the standard library, but it could. Conversion to and from AtomicString would be on the user, but it's easy enough to code. On Sat, Feb 22, 2020, 7:28 PM Steve Jorgensen <stevej@stevej.name> wrote:
From reading many of the suggestions in this list (including some of my own, of course) there seem to be a lot of us in the Python coder community who find the current behavior of strings being iterable collections of single-character strings to be somewhat problematic. At the same time, trying to change that would obviously have vast consequences, so not something that can simply be changed, even with at a major version boundary.
I propose that we start by making a small, safe step in the right direction so we can maybe reach the bigger goal in a far future release (e.g. 5.x or 6.x).
The step I propose is to add a `chars` method that returns a sequence (read-only view) of single-character strings that behaves exactly the same as `str` currently does when treated as a collection. Encourage developers to use that for iterating over characters in a string instead of iterating over the string directly. In maybe Python 4 or 5, directly using an `str` instance as a collection could become a warning.
BTW, while adding `chars`, it might also be nice to have `ords` which would be a view of the string's character sequence as `ord` integer values. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/WKEFHT... Code of Conduct: http://python.org/psf/codeofconduct/
I've also had to special case strings when dealing with iterables generically, and it's annoying, but it's not a big deal. The real problem is when you meant to pass an iterable of strings and you just passed a single string and it produces confusing behaviour - something more subtle than each character being laid out separately. And even this is not that hard for experienced devs like us to figure out, but it really bites beginners hard, and I think that's the argument worth focusing on. A common example is when beginners write code like this: cursor.execute('INSERT INTO strings VALUES (?)', 'hello') and get this confusing error: sqlite3.ProgrammingError: Incorrect number of bindings supplied. The current statement uses 1, and there are 5 supplied. Finding questions about these is pretty easy, below are some examples: https://stackoverflow.com/questions/54856759/sqlite3-programmingerror-incorr... https://stackoverflow.com/questions/16856647/sqlite3-programmingerror-incorr... https://stackoverflow.com/questions/6066681/python-sql-select-statement-from... https://stackoverflow.com/questions/33768447/incorrect-number-of-bindings-su... https://stackoverflow.com/questions/35560106/incorrect-number-of-bindings-su... https://stackoverflow.com/questions/58786727/incorrect-number-of-bindings-su... So first of all, I think we should probably have a check in the sqlite3 library for passing a single string as a parameter. But in the general case, it would be great if strings weren't iterable and trying to iterate over them raised an exception with a helpful generic message, like: "Strings are not iterable - you cannot loop over them or treat them as a collection. Perhaps you meant to use string.chars(), string.split(), or string.splitlines()?"
On Sun, Feb 23, 2020 at 11:25:12PM +0200, Alex Hall wrote:
"Strings are not iterable - you cannot loop over them or treat them as a collection.
Are you implying that we should deprecate the `in` operator for strings too? Strings *are* collections: py> import collections.abc py> isinstance("spam", collections.abc.Collection) True Strings aren't atomic values: they contain substrings. They can be sliced. We can ask whether one string contains another: strings are containers as well as collections. Sometimes we treat strings as if they were pseudo-atomic. And sometimes we treat tuples as pseudo-atomic records too. Should tuples no longer be iterable? I don't think so. -- Steven
Are you implying that we should deprecate the `in` operator for strings
No, we should definitely keep the `in` operator. We can revisit the best wording for the error/warning message later, my point is just that it should be more considerate to beginners than "TypeError: 'str' object is not iterable".
Strings *are* collections:
Technically, a [collection]( https://docs.python.org/3/library/collections.abc.html#collections.abc.Colle...) is an iterable sized container, so if strings aren't iterable, they aren't collections.
Sometimes we treat strings as if they were pseudo-atomic. And sometimes we treat tuples as pseudo-atomic records too. Should tuples no longer be iterable? I don't think so.
We treat strings as pseudo-atomic FAR more than we treat tuples as such. If tuples weren't iterable then tuple unpacking wouldn't work and all hell would break loose. I don't think this comparison works.
Steven D'Aprano wrote:
On Sun, Feb 23, 2020 at 11:25:12PM +0200, Alex Hall wrote:
"Strings are not iterable - you cannot loop over them or treat them as a collection. Are you implying that we should deprecate the in operator for strings too?
I would not get rid of the `in` behavior, but the `in` behavior of a string is actually not like that of the `in` operator for a typical collection. Seen as simply a collection of single-character strings, "b" would be in "abcd", but "bc" would not. The `in` operator for strings is checking whether the left operand is a substring as opposed to an item. `(2, 3)` is not `in` `(1, 2, 3, 4)`.
Alex Hall wrote:
I've also had to special case strings when dealing with iterables generically, and it's annoying, but it's not a big deal. The real problem is when you meant to pass an iterable of strings and you just passed a single string and it produces confusing behaviour - something more subtle than each character being laid out separately. And even this is not that hard for experienced devs like us to figure out, but it really bites beginners hard, and I think that's the argument worth focusing on. A common example is when beginners write code like this: cursor.execute('INSERT INTO strings VALUES (?)', 'hello')
and get this confusing error: sqlite3.ProgrammingError: Incorrect number of bindings supplied. The current statement uses 1, and there are 5 supplied. Finding questions about these is pretty easy, below are some examples: https://stackoverflow.com/questions/54856759/sqlite3-programmingerror-incorr... https://stackoverflow.com/questions/16856647/sqlite3-programmingerror-incorr... https://stackoverflow.com/questions/6066681/python-sql-select-statement-from... https://stackoverflow.com/questions/33768447/incorrect-number-of-bindings-su... https://stackoverflow.com/questions/35560106/incorrect-number-of-bindings-su... https://stackoverflow.com/questions/58786727/incorrect-number-of-bindings-su... So first of all, I think we should probably have a check in the sqlite3 library for passing a single string as a parameter. But in the general case, it would be great if strings weren't iterable and trying to iterate over them raised an exception with a helpful generic message, like: "Strings are not iterable - you cannot loop over them or treat them as a collection. Perhaps you meant to use string.chars(), string.split(), or string.splitlines()?"
`(3)` and `(3,)` are not the same and I think there is nothing wrong with literal integers. A library implemented in a confusing way is not an example of nothing wrong on Python strings. (I myself has made this stupid mistake many times and I cannot blame neither Python nor sqlite for being careless.) In my humble opinion, your example does not prove that iterable strings are faulty. They can be tricky in some occasions, I admit it... but there are many tricks in all programming languages especially for newbies (currently I am trying to learn Lisp... again).
I agree with the numerous posters who have brought up the backward-compatibility concern. This change *would* break lots of code. At the same time, this bites me consistently, so I'd like to do something soon... at least sooner than 6.0 ;). I believe that this is better solved by static analysis. I suggested some time ago on typing-sig that we explore adding a `Chr` type to typing, and type `str` as a `Sequence[Chr]` rather than a `Sequence[str]`. You can read the proposal here (it's not very complex at all, and should be backward-compatible for all but the hairiest cases, which just need either a cast or an annotation): https://mail.python.org/archives/list/typing-sig@python.org/thread/OLCQHSNCL... With it, we have a path forward where type-checkers like mypy assure us that we're really doing what we think we're doing with that string, and require explicit annotations or casts for the ambiguous cases. That discussion fizzled out, but I'm still very much interested in exploring the idea if it seems like a realistic alternative. I think it makes much more sense than changing the mostly-sensible, well-known, often-used runtime behavior of strings. Brandt
I would like to reiterate a point that I think is very important and many people seem to be brushing aside. We don't have to *break* existing code. We can get a lot of value, at least in terms of aiding debugging, just by adding a warning. That warning never has to evolve into an exception, certainly not anytime soon. The most damage it would do is some clutter in stderr, and that would only be for some time while libraries adapted. People add deprecation warnings all the time. Consider again the example of taking the boolean of a numpy array or pandas Series. That certainly broke some existing code. And it broke consistency where bool() is usually determined by len(). But most importantly, it was a reversible change. Right now, the maintainers could look at the community reaction and decide to make bool(array) work again as expected, or maybe in a new way. Doing so wouldn't break any working code because no working code uses bool(array). But they have chosen not to, presumably because they believe the current behaviour is still for the best. So here's what I take from all this: 1. The 'experiment' to force users to state their intentions explicitly to avoid subtle logical bugs is deemed a success. 2. If our 'experiment' failed and users were really offended by seeing warnings, we could undo it. We'd leave chars() behind as a noop. No code would be broken by the reversal. So the extent of the damage in the worst case scenario would be even more limited. You might complain that now there'd be two ways to iterate over characters, but similarly I always choose to add .keys() when I iterate over a dict even though it's redundant, because it makes the code clearer. 3. Regarding a point made by Chris: introducing the error in bool() is considered OK even though it's sometimes hard to see where bool() is being used, such as when a user writes `df[0 < df.val < 1]` which is the equivalent of `df[0 < df.val and df.val < 1]` when they want the behaviour of `df[0 < df.val & df.val < 1]`. On Mon, Feb 24, 2020 at 10:12 PM Brandt Bucher <brandtbucher@gmail.com> wrote:
I agree with the numerous posters who have brought up the backward-compatibility concern. This change *would* break lots of code. At the same time, this bites me consistently, so I'd like to do something soon... at least sooner than 6.0 ;).
I believe that this is better solved by static analysis. I suggested some time ago on typing-sig that we explore adding a `Chr` type to typing, and type `str` as a `Sequence[Chr]` rather than a `Sequence[str]`. You can read the proposal here (it's not very complex at all, and should be backward-compatible for all but the hairiest cases, which just need either a cast or an annotation):
https://mail.python.org/archives/list/typing-sig@python.org/thread/OLCQHSNCL...
With it, we have a path forward where type-checkers like mypy assure us that we're really doing what we think we're doing with that string, and require explicit annotations or casts for the ambiguous cases. That discussion fizzled out, but I'm still very much interested in exploring the idea if it seems like a realistic alternative. I think it makes much more sense than changing the mostly-sensible, well-known, often-used runtime behavior of strings.
Brandt _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/IZAH5A... Code of Conduct: http://python.org/psf/codeofconduct/
As one of those who 1. thinks there IS a problem, 2. that the problem is bigger than most of the people on this thread seem to think, I am nevertheless in agreement that tackling the problem by changing the language to outlaw direct str iteration would just be far, far too disruptive. I am much more receptive to the idea of adding a totally separate atomic string.
I agree with the numerous posters who have brought up the backward-compatibility concern. This change *would* break lots of code. At the same time, this bites me consistently, so I'd like to do something soon... at least sooner than 6.0 ;).
I believe that this is better solved by static analysis. I suggested some time ago on typing-sig that we explore adding a `Chr` type to typing, and type `str` as a `Sequence[Chr]` rather than a `Sequence[str]`. You can read the proposal here (it's not very complex at all, and should be backward-compatible for all but the hairiest cases, which just need either a cast or an annotation):
https://mail.python.org/archives/list/typing-sig@python.org/thread/OLCQHSNCL...
With it, we have a path forward where type-checkers like mypy assure us that we're really doing what we think we're doing with that string, and require explicit annotations or casts for the ambiguous cases. That discussion fizzled out, but I'm still very much interested in exploring the idea if it seems like a realistic alternative. I think it makes much more sense than changing the mostly-sensible, well-known, often-used runtime behavior of strings.
Brandt
Also as a person who has really come to LOVE static typing in Python since it has saved me literally HOURS of debugging the (awful) code I tend to write (mostly in PyCharm actually; haven't made much use of mypy yet), tackling this issue via mypy and static analysis very much sounds, to me, like a great way to go. If there were a Chr type with which to statically type against in the manner of Sequence[Chr], and also perhaps an "AtomicString" static type that does nothing but disallow Sequence-type behavior (iteration and slicing) in static analysis (but could still be type cast to regular old str), I sure would use the heck out of it. The type/method warnings in Pycharm would solve ALL of the problems I have run into over the years- I would no longer need to write non-idiomatic, unpythonic guarding code trying to stop myself from sending strings into functions meant for non-str iterables of strings only.
participants (23)
-
Alex Hall
-
Andrew Barnert
-
André Roberge
-
Brandt Bucher
-
Bruce Leban
-
Caleb Donovick
-
Chris Angelico
-
Christopher Barker
-
David Mertz
-
Dominik Vilsmeier
-
Eric V. Smith
-
Ethan Furman
-
Greg Ewing
-
jdveiga@gmail.com
-
Kyle Stanley
-
M.-A. Lemburg
-
Paul Moore
-
Rhodri James
-
Richard Damon
-
Ricky Teachey
-
Sebastian Berg
-
Steve Jorgensen
-
Steven D'Aprano