[Python-ideas] Re: Incremental step on road to improving situation around iterable strings

23 Feb 2020

      On Feb 23, 2020, at 12:52, Steve Jorgensen <stevej@stevej.name> wrote:
...
The only change I am proposing is that the iterability for characters in a string be moved from the string object itself to a view that is returned from a `chars()` method of the string. Eventually, direct iteratability would be deprecated and then removed.
I do not want indexing behavior to be moved, removed, or altered, and I am not suggesting that it would/should be.
That would be very weird. Something that acts like a sequence in every way—indexing, slicing, Sequence methods like count, other methods that return indices, etc.—except that it isn’t Iterable doesn’t feel like Python. Python even lets you iterate over even “old-style semi-sequences”, things which define __getitem__ to work with a contiguous sequence starting from 0 until they raise IndexError.

I think if you want to move iteration to chars, you’d want to move sequence behavior there too.

Also, I think you’d still want the chars view to iterate a new char type rather than strs or chars views; otherwise you still have the infinite regress problem—it only shows up when you decide to explicitly recurse into str (iterate anything that’s iterable, and iterate chars() on anything that’s a str), but it’s just as bad as the current state when you do; there’s still no way to say “recursively iterate strings, but only down to characters, not infinitely”.

I’m not sure I like the idea in any variation for Python, but a few more points in favor of it:

The chars view would open the door for additional views on strings. See Swift, making you state explicitly whether you want to iterate UTF-8 code points (bytes), UTF-32 code points, or enhanced grapheme clusters, instead of just picking one and that’s what you get (and the other two require constructing some separate object that copies stuff from the string). After all, a string is an iterable of all of those things; the fact that it happens to be stored as an array of Latin bytes, UCS2 code units, or UTF-32 code points, with a cache of UTF-8 bytes, doesn’t force us to treat it as an iterable of UTF-32 code points; only legacy reasons do.

And having a strutf8view could mean that in many apps, all bytes objects are binary data rather than encoded text, which makes the bytes type more semantically meaningful in those apps.

It could also make bridging libraries to languages where strings aren’t iterable more reasonable. For example, IIRC, pyobjc NSString objects today have methods to iterate strs so they can ducktype as strings; if strings weren’t Iterable, they could be much closer to a trivial pure bridge to the ObjC type.

Finally, a bunch of unicodedata functions and so on that are only make sense on single characters have to take str today and raise a ValueError or something if passed multiple characters. (There are even some Unicode functions that only make sense on single EGCs, but I think Python doesn’t provide any of them.) Passing a char object, you’d know statically that it makes sense; passing a str object, you don’t.

[Python-ideas] Re: Incremental step on road to improving situation around iterable strings

Andrew Barnert