[Python-ideas] Re: Incremental step on road to improving situation around iterable strings

3 Mar 2020

      On Mar 2, 2020, at 15:13, Steven D'Aprano <steve@pearwood.info> wrote:
...
On Sun, Feb 23, 2020 at 01:46:53PM -0500, Richard Damon wrote:
...
I would agree with this. In my mind, fundamentally a 'string' is a
sequence of characters, not strings,
If people are going to seriously propose this Character type, I think 
they need to be more concrete about the proposal and not just hand-wave 
it as "strings are sequences of characters".
I actually wrote a half-proposal on this several years back (plus a proof-of-concept implementation that adds the chr type, but doesn’t change str to use it or interact with it), but decided the backward compatibility problems were too big to go forward. I can dig it up if anyone’s interested, but I can summarize it and answer your questions from memory.

The type is called chr, it represents a single Unicode code unit, and it can be constructed from a chr, a length-1 str, or an int. It has an __int__ method but not an __index__. The repr is chr("A"); the str is just A. It is not Iterable—this is the whole point, after all, that recursing on iter(element) hits a base case.

Note that a chr can not represent an extended grapheme cluster; that would have to be represented as a str, or as some new type that’s also a Sequence[chr]. But since Python strings don’t act like sequences of EGCs today, that’s not a new problem. It could be used to hold code points (so bytes could also become a sequence of chr instead of int), but that seemed too confusing (if chr(196) could be either a UTF-8 lead byte or the single character 'Ä' depending on how you got it… that feels like reopening the door to the same problems Python 3 eliminated).

It has some of the same methods as str, like lower, but not those that are container-ish like find. (What about encode? translate? I can’t remember.)

Meanwhile, all the container-ish methods on str (including __contains__) can now take a chr, but can also still take a str (in which case they still do substring rather than containment tests). This is a little weird, but that weirdness is already in str today (x in y does not imply any(a==x for a in y) today; it would become true if x is chr but still not when x is str), and convenient enough that you wouldn’t want to get rid of it.

It would be really nice to be able to construct a str from any Iterable[chr], but of course that can’t work. So, how do you go back to a str once you’ve converted into a different Iterable of chr? You need a new alternate constructor classmethod like str.fromchars, I think.

IIRC, you can add chr to each other and to str, getting a str. You can also multiply them, getting a str (even chr("A")*1 is a str, not a chr). Again, it’s a little weird for non-sequence types to have sequence-like concat/repeat but I don’t think it looks confusing in actual examples, and again, it’s convenient enough that I think it’s worth it.

I considered adding and subtracting chr+int (which does the same as chr(ord(self)+other) which makes it easier to translate a lot of C code), but without char%int that feels incomplete, while with char%int it feels confusing (it would be completely different from str%, and also, having % be arithmetic but * be repeat on the same type just seems very wrong).
...
Presumably you would want `mystring[0]` to return a char, not a str, but 
there are plenty of other unspecified details.
- Should `mystring[0:1]`return a char or a length 1 str?
A str. That’s how all sequences work—slicing a sequence (even a len=1 slice) never returns an element; it always returns a sequence of the same type (or, for some third party types, a view that duck types as the same type).
...
- Presumably "Z" remains a length-1 str for backward compatibility,
so how do you create a char directly?
chr("Z") or, if you really want to, "Z"[0] would also work with two fewer keystrokes.
...
- Does `chr(n)` continue to return a str?
No, because it’s the constructor of the new type. This is a backward compatibility problem, of course. Which could be solved by naming the type something else, like char, and making chr still return a str, but IIRC, this backward compatibility problem is subsumed in the larger one below, so there’s no point fixing just this one. (Also, any new name you come up with probably already appears in lots of existing code, which would add confusion; reusing chr doesn’t have that problem.)
...
- Is the char type a subclass of str?
No. It doesn’t support much of the str API, including all of the ABCs, so that would badly break substitutabilty. Also, it would defeat the purpose of having a separate type.
...
- Do we support mixed concatenation between str and char?
Yes. See above.
...
- If so, does concatenating the empty string to a char give a char 
or a length-1 string?
str+chr, chr+str, and chr+chr all return str, always. Just like chr*int returns str even when the int is 1.
...
- Are chars indexable?
No.
...
- Do they support len()?
No.
...
If char is not a subclass of string, that's going to break code that 
expects that `all(isinstance(c, str) for c in obj)` to be true when 
`obj` happens to be a string.
Yes. This is the biggest backward compatibility problem, and the one that made me abandon the proposal before sharing it. A str is an Iterable of str today, and there is code that expects this. Not that often directly, but indirectly it comes up all the time.

A lot of code just duck-types the elements in ways that would continue to work. But the big problem is the same one you face with any attempt to duck type str: zillions of C extension functions, both stdlib and third party, will only take a str, which you can call the PyUnicode API on. IIRC, Nick Coghlan tried creating an “encoded_str” type that is-a str and also is-a bytes (and knows its encoding), and wrote up all the problems he ran into, and most of them applied here as well. The most obvious one is str.join (and fixing that to take an Iterable[str|chr] would solve 90% of the problems with student exercises), but it’s just one of a huge number of similar functions (e.g., IIRC, _pyio.TextIOWrapper.write works, but the C version doesn’t—although I think there was a bug report for that one, so maybe it’s not true anymore), and there’s no way to fix them all.

There’s another issue with whether "Z"==chr("Z") or not. If not, a lot of student exercise code and quick&dirty scripting breaks: ''.join({'a': 'n', 'b': 'o', …}.get(c, c) for c in a) won’t work anymore (although similar code using str.translate, or dict(zip("abc…", "nop…")) does still work). But if so… I can’t remember the problem (something to do with a cache somewhere; probably?), but there was one.
...
If char is a subclass, that means we can no longer deny that strings are 
sequences of strings, since chars are strings. It also means that it 
will break code that expects strings to be iterable,
Right, but it’s already so wrong that you don’t even have to get this far to rule it out. It’s the non-subclass answer that’s attractive (but ultimately, I think, still doesn’t pan out).
...
I don't have a good intuition for how much code will break or simply 
stop working correctly if we changed string iteration to yield a new 
char type instead of length-1 strings.
A lot more than I expected. :)
...
Nor do I have a good intuition for whether this will *actually* help 
much code.
In cases where you really do want to just flatten/recurse/whatever infinitely, it solves the problem perfectly.

But the reality is that often you want to treat strings as atoms, not iterables of characters, so you’d still need the same switching code as today—and I’m not sure debugging problems related to chr not being usable as a str is any easier than debugging the RecursionError. For example, imagine code to dump all the “leaf values” in a JSON document. If you just blindly recurse into everything Iterable, you get a RecursionError and slap yourself and fix it. If you instead successfully write a file full of single characters, you won’t discover the problem any faster…

It still might be a worthwhile tradeoff in a new Python-like language; I’m really not sure. I think I’d actually want to have a chr type, but have str not be Iterable, and instead have a property that was (in fact, separate properties for iterating code units, EGCs, and UTF-8 bytes, all as different types).
...
...
so as you iterate over a string,
you shouldn't get another string, but a single character type (which
Python currently doesn't have). It would be totally shocking if someone
suggested that iterating a list or a tuple should return lists or tuples
of 1 element, so why do strings to this?
Would it be so shocking though?
If lists are *linked lists* of nodes, instead of arrays, then:
- iterating over linked lists of nodes gives you nodes;
- and a single node is still a list;
Well, usually you’d want the iterator to yield just the node.value, not the node itself, at least if you’re thinking Lisp/ML/Haskell style with cons lists. And if you’re not thinking cons lists, often the list is a handle object (e.g., if you want to mutably insert an element at the start, or if you want a double-linked list, etc.), as in C++, not a node.

But there are cases where you’re dealing with complex objects that are internally linked up (e.g., I used to deal with a ton of C code that dealt with internally linked database/filesystem-style extents lists, and the handle is not the list object but an implicit thing somewhere else, and the list itself is just the node), and for cases like that, your point is valid.

[Python-ideas] Re: Incremental step on road to improving situation around iterable strings

Andrew Barnert