Python 4000: Have stringlike objects provide sequence views rather than being sequences

There are many cases in which it is awkward that testing whether an object is a sequence returns `True` for instances of of `str`, `bytes`, etc. This proposal is a serious breakage of backward compatibility, so would be something for Python 4.x, not 3.x. Instead of those objects _being_ sequences, have them provide views that are sequences using a method named something like `members` or `items`.

On Oct 13, 2019, at 12:02, Steve Jorgensen <stevej@stevej.name> wrote:
There are many cases in which it is awkward that testing whether an object is a sequence returns `True` for instances of of `str`, `bytes`, etc.
This proposal is a serious breakage of backward compatibility, so would be something for Python 4.x, not 3.x.
I’m pretty sure almost nobody wants a 3.0-like break again, so this will probably never happen.
Instead of those objects _being_ sequences, have them provide views that are sequences using a method named something like `members` or `items`.
Nothing else in Python works like this. Dicts do have an `items` method, but that provides an iterable (but not indexable) view of key-value pairs, while the dict itself is an iterable if its keys. So I think this would be pretty confusing. Also, would you want them to not be iterable either? If so, that would break even more code; if not, I don’t think it would actually solve that much in the first place. The main problem is that a str is a sequence of single-character str, each of which is a one-element sequence of itself, etc. forever. If you wanted to change this, I think it would make more sense to go the opposite way: leave str a sequence, but make it a sequence of char objects. (And likewise, bytes and bytearray could be sequences of byte objects—or just go all the way to making them sequences of ints.) And then maybe add a c prefix for defining char constants, and you’ve solved all the problems without having to add new confusing methods or properties. Meanwhile, the most common places you run into this problem are in functions that take a single str argument or a single iterable-of-str argument. Most such cases have already been solved by taking a str or tuple-of-str, which is clunky, even it’s worked since Python 0.9. But a better solution for almost all such cases is to just change the function to take a *args parameter for 0 or more string arguments. While we’re at it, if you really wanted to make a radical breaking change to Python involving view objects, I’d prefer one that expanded on dict views, to make all kinds of lazy view objects that are sequences or sets (e.g., calling map on a sequence gives you a sequence that’s computed on the fly; filtering a set gives you a set; reversing a sequence gives you a sequence; etc.), rather than making something else that’s kind of similar but doesn’t work the same way. And finally, if you want to break strings, it’s probably worth at least considering making UTF-8 strings first-class objects. They can’t be randomly accessed, but with an iterable-plus API like files, with seek/tell, or a new more powerful iterable API like Swift or C++, a lot of languages have found that to be a useful trade off anyway. But again, I doubt any of this is likely to happen, as nobody wants to go through another decade-long painful transition unless the benefits are a whole lot bigger than fixing a couple of minor things people have already learned how to deal with.

On Mon, Oct 14, 2019 at 6:49 AM Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
And finally, if you want to break strings, it’s probably worth at least considering making UTF-8 strings first-class objects. They can’t be randomly accessed, but with an iterable-plus API like files, with seek/tell, or a new more powerful iterable API like Swift or C++, a lot of languages have found that to be a useful trade off anyway.
Breaking the str type to do this seems like a really REALLY bad idea, but if you want a first-class UTF8String, you can certainly have it. Build it on top of some sort of byte buffer (maybe bytearray rather than bytes) with a whole lot of handy methods, and there you are. ChrisA

Yup. I think you're absolutely right. After I posted this, I had a better idea: https://mail.python.org/archives/list/python-ideas@python.org/thread/OVP6SIO...

On Sun, Oct 13, 2019 at 12:52 PM Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
I've thought for a long time that this would be a "good thing". the "string or sequence of strings" issues is pretty much the only hidden-bug-triggering type error I've gotten since "true division". The only way we really live with it fairly easily is that strings are pretty much never duck typed -- so I can check if I got a string, and then I know I didn't get a sequence of strings. But I've always wondered how disruptive it would be to add a char type -- it doesn't seem like it would be very disruptive, but I have not thought it through at all. And I'm not sure how much string functionality a char should have -- probably next to none, as the point is that it would be easy to distinguish from a string that happened to have one character. By the way, the bytes and bytearray types already does this -- index into or loop through a bytes object, you get an int. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Oct 23, 2019, at 16:00, Christopher Barker <pythonchb@gmail.com> wrote:
Well, just adding a char type (and presumably a way of defining char literals) wouldn’t be too disruptive. But changing str to iterate chars instead of strs, that probably would be. Also, you’d have to go through a lot of functions and decide what types they should take. For example, does str.join still accept a string instead of an iterable of strings? Does it accept other iterables of char too? (I have used ' '.join on a string in real life production code, even if I did feel guilty about it…) Can you pass a char to str.__contains__ or str.endswith? What about a tuple of chars? Or should we take the backward-compat breaking opportunity to eliminate the “str or tuple of str” thing and instead use *args, or at least change it to “str or iterable of str (which no longer includes str itself)”?
And I'm not sure how much string functionality a char should have -- probably next to none, as the point is that it would be easy to distinguish from a string that happened to have one character.
Surely you’d want to be able to do things like isdigit or swapcase. Even C has functions to do most of that kind of stuff on chars. But I think that, other than join and maybe encode and translate, there’s an obvious right answer for every str method and operator, so this isn’t too much of a problem. Speaking of operators, should char+int and char-int and char-char be legal? (What about char%int? A thousand students doing the rot13 assignment would rejoice, but allowing % without * and // is kind of weird, and allowing * and // even weirder—as well as potentially confusing with str*int being legal but meaning something very different.)
By the way, the bytes and bytearray types already does this -- index into or loop through a bytes object, you get an int.
Sure, but b'abc'.find(66) is -1, and b'abc'.replace(66, 70) is a TypeError, and so on. Fixing those inconsistencies is what I meant by “go all the way to making them sequences of ints”. But it might be friendlier to undo the changes and instead add a byte type like the char type for bytes to be a sequence of. I’m not sure which is better. But anyway, I think all of these questions are questions for a new language. If making str not iterate str was too big a change even for 3.0, how could it be reasonable for any future version?

Well, just adding a char type (and presumably a way of defining char
There's a reason I've never actually proposed adding a char .... On Wed, Oct 23, 2019 at 5:34 PM Andrew Barnert <abarnert@yahoo.com> wrote: literals) wouldn’t be too disruptive. sure.
But changing str to iterate chars instead of strs, that probably would be.
Also, you’d have to go through a lot of functions and decide what types
And that would be the whole point -- a char type by itself isn't very useful. in some ssense, the only difference between a char and a str would be that a char isn't iterable -- but the benefit would be that a string is an iterable (and sequence) of chars, rather than an (infinitely recursable) iterable of strings. they should take. sure would -- a lot of thought to see how disruptive it would be ...
For example, does str.join still accept a string instead of an iterable of strings? Does it accept other iterables of char too?
if it accepted an iterable of either char or str, then I *think* there would be little disruption.
Can you pass a char to str.__contains__
yes, that's a no brainer, the whole point is that a string would be a sequence of chars.
or str.endswith?
I would think so -- a char would behave like a length-one string as much as possible.
What about a tuple of chars?
Or should we take the backward-compat breaking opportunity to eliminate
that's an odd one -- but I'm not sutre I see the point, if you have a tuple of chars, you could "".join() them if you want a string, in any context. the “str or tuple of str” thing and instead use *args, or at least change it to “str or iterable of str (which no longer includes str itself)”? Is this for .endswith() and friends? if so, there was discussion a while back about that -- but probably not the time to introduce even more backward incompatible changes. And I'm not sure how much string functionality a char should have -- probably next to none, as the point is that it would be easy to distinguish from a string that happened to have one character.
Surely you’d want to be able to do things like isdigit or swapcase. Even C has functions to do most of that kind of stuff on chars.
probably -- it would be least disruptive for a char to act as much as possible the same as a length-one string -- so maybe inexorability and indexability would be it.
But I think that, other than join and maybe encode and translate,
there’s an obvious right answer for every str method and operator, so
not sure why encode or translate should be an issue off the top of my head -- it would surley be a unicode char :-) this isn’t too much of a problem. well, we'd have to go through all of them, and do a lot of thinking... I think the greater confusion is where can you use a char instead of a string in other places? using it as a filename, for instance would make it pointless for at least the cases I commonly deal with (list of filenames). I can only imagine how many "things" take a string where a char would make sense, but then it gets harder to distinguish them all.
Fixing those inconsistencies is what I meant by “go all the way to making
I would say no -- in C a char IS an unsigned 8bit int, but that's C -- in pyhton a char and a number are very diferent things. ord() and chr() would work, of course. By the way, the bytes and bytearray types already does this -- index into or loop through a bytes object, you get an int. Sure, but b'abc'.find(66) is -1, and b'abc'.replace(66, 70) is a TypeError, and so on. I wonder if they need to be -- would we need a "byte" type, or would it be OK to accept an int in all those sorts of places? them sequences of ints”. But it might be friendlier to undo the changes and instead add a byte type like the char type for bytes to be a sequence of. I’m not sure which is better. me neither.
Well, I don't know that it was seriously considered -- with the Unicode changes, that WOULD have been the time to do it! Again though,, it seems like it would be pretty disruptive, so a non-starter, but maybe not? -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Christopher Barker wrote:
I've always wondered how disruptive it would be to add a char type
I'm not sure if it would help much. Usually the problem with strings being sequences of strings lies in the fact that they're sequences at all. Code that operates generically on nested sequences often has to special-case strings so that it can treat them as atomic values. Having them be sequences of something else wouldn't change that. If I were to advocate changing anything in that area, it would be to make strings not be sequences. They would support slicing, but not indexing single characters, and would not be directly iterable. If you really wanted to iterate over characters, there could be a method such as s.chars() giving a sequence view. But that would be a disruptive enough change for so little benefit that I don't expect it to ever happen. -- Greg

On Thu, Oct 24, 2019 at 1:13 AM Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
wouldn't it? once you got to an object that couldn't be iterated, you'd know you had an atomic value. And this is why I was thinking that chars had less functionality, it would work. This is really common code for me that I need to type check: for filename in sequence_of_filenames: open(filename) if a char could not be used as a filename, then I'd get a similar error if a single string was passed in as if a list of numbers was passed in, say. That is, a string is a sequence of chars, not a sequence of strings. and a char can not be used as a string in many contexts. If I were to advocate changing anything in that area, it would
you are right -- that would be a great solution to the above problem. And I can't think of many real uses for iterating strings where you don't know that you want the chars, so .chars() iterator, or maybe str.iter_chars() would be fine. Something tells me that I've had uses for char in other contexts, but I can't think of them now, so maybe not :-) But again -- too disruptive, we've lived with this for a LONG time. -CHB
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Christopher Barker wrote:
wouldn't it? once you got to an object that couldn't be iterated, you'd know you had an atomic value.
I'm thinking of things like a function to recursively flatten a nested list. You probably want it to stop when it gets to a string, and not flatten the string into a list of characters. -- Greg

On Oct 24, 2019, at 14:13, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
A function to recursively flatten a nested list should only work on lists; it should stop on a string, but it should also stop on a namedtuple or a 2x2 ndarray or a dict. A function to recursively flatten arbitrary iterables, on the other hand… And I don’t think there’s any conceptual problem with strings being iterable. A C++ string is a sequence of chars. A Haskell string is a plain old (lazy linked) list of chars. And similarly in lots of other languages. And it’s rarely a problem. There are other differences that might be relevant here; I don’t think they are, but to be fair: C++ and Haskell implementations are expected to optimize everything well enough that you can just any arbitrary sequence of chars as a string with reasonable efficiency, so strings being a thin convenience wrapper above that makes intuitive sense. In Python, that isn’t true; a function that loops character by character would often be too slow to use. C++ and Haskell type systems make it a little easier to say “Here’s a function that works on generic iterables of T, but when T is char here’s a more specific function”. But again, I don’t think either of these is the reason Python strings being iterable is a problem; I think it really is primarily about them being iterables of strings.

On Thu, 24 Oct 2019 at 23:47, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
But again, I don’t think either of these is the reason Python strings being iterable is a problem; I think it really is primarily about them being iterables of strings.
The *real* problem is that there's a whole load of functions that would need rewriting to accept "character or string" arguments - or a whole load of debating over whether they should only use one or another. Examples: "a" in char_string, vs "word" in sentence. Both useful and used in real code. list_of_stuff.join(",") vs list_of_stuff.join(", ") list_of_characters.join("") And return values: string.partition(sep) - if sep is a character, should the middle return value be a character too? Remember that the typing module needs to be able to express the type signatures of all these functions, in a way that usefully allows checking usage. Plus many, many more. And not just in the stdlib, but in millions of lines of 3rd party code. And no cheating by saying these are cases where you should use 1-character strings. The fact that people don't typically distinguish between characters and 1-character strings in real life is precisely why it's useful that Python currently doesn't make a distinction either. In case it's not glaringly obvious, I'm -1 on this, even as an exercise in speculation :-) Paul

On Oct 25, 2019, at 01:34, Paul Moore <p.f.moore@gmail.com> wrote:
That’s not the problem that this thread is trying to solve, it’s the problem with the solution to that problem. :) I’ve already said that I don’t think this is feasible, because it would be too much work and too disruptive to compatibility. If you were designing a new Python-like language today, or if you had a time machine back to the 90s, it would be a different story. But for Python 4.0—even if we wanted a 3.0-like break, which I don’t think anyone does—we can’t break all of the millions of lines of code dealing with strings in a way that forces people to not just fix that code but rethink the design.
Many of your examples are not cases where people should use 1-character strings; they’re cases where we need polymorphic APIs. But that’s not a problem. Countless other languages already do exactly that, and people use them every day. (The languages that are too weak to do that kind of polymorphism, like C and PHP, instead have to offer separate functions like strstr vs. strchr, which is manifestly less friendly. Fortunately, fact that Python’s core API was originally loosely based on C’s string.h wouldn’t in any way force the same problem on Python or a Python-like language.) And, while there are plenty of functions that would need to treat char and 1-char str the same, there are also many—not just iter—that should only work for one or the other, such as ord, or re.search. And there are also functions that should work for both, but not do exactly the same thing—char replacement is the same thing as translate; substring replacement is not. In fact, notice what happens if you call ord on a 2-char string today: TypeError: ord() expected a character, but string of length 2 found. Python is _pretending_ that it has a character type even though it doesn’t. And you find the same thing in third-party code: some of the functions people have written would need to handle str and char polymorphically, some would need to handle both but with different logic, and many would need to handle just one or the other. Which is exactly why it couldn’t be fixed with a 3to4 script, and people would instead need to rethink the design of a bunch of their existing functions before they could upgrade.
In case it's not glaringly obvious, I'm -1 on this, even as an exercise in speculation :-)
I’m -1 on this, but I think speculating on how it could be solved is the best way to show that it can’t and shouldn’t be solved.

25.10.19 15:53, Andrew Barnert via Python-ideas пише:
If you were designing a new Python-like language today, or if you had a time machine back to the 90s, it would be a different story.
Interesting, how far in past you will need to travel? Initially builtin types did not have methods or properties, and the iterable protocol did not exist. Adding this will require too much work, and I am not sure Guido would like how much complexity it adds to his simple language.

On Oct 25, 2019, at 06:26, Serhiy Storchaka <storchaka@gmail.com> wrote:
Well, the str methods are largely carried over from the functions in the string module, which was there before 1.0. And I think the ord builtin goes back pretty far as well. So ideally, back to the start. On the other hand, it’s not like there was a huge ecosystem of third-party modules using Python 0.9 that Guido couldn’t afford to break, so If your time machine couldn’t go back quite that far, it might be ok to do it as late as, say, 2.2.
Adding this will require too much work, and I am not sure Guido would like how much complexity it adds to his simple language.
Adding it today would certainly require too much work—not so much for Python itself as for thousands of third-party libraries and even more applications, but that’s even worse. Even if it weren’t just to fix a small wart that people have been dealing with for decades, it would be too much. That’s why I’m -1 on it. But adding it in early Python would have been very little work. Just add one more builtin type, and change a dozen or so builtin and string module functions, and you’re done. And it would be very little complexity, too. Yes, it would mean one more built in type, but nothing else is complex about it. The Smalltalk guys who were advertising that their whole language fits on an index card could handle chars. And it’s not like it would be an unprecedented innovation—most languages that existed at the time, from Lisp to Perl, had strings as sequences of either chars or integers (or, as with C, of chars but char is just a type of integer). If anything, it was ABC that was innovative for making strings out of strings (although Tcl and BASIC also do); it just turned out to collide with other innovations that Python got over the next couple of years, in ways nobody could have imagined. And the hard bit would be describing what the Python 2.x language and ecosystem would look like to Guido so you could explain why it would matter. The idea that his language would have an “iterable protocol” that could be implemented by not just lists and strings but also user-defined types using special methods, and automatically by functions using a special coroutine semantics, and also by some syntax borrowed from Haskell, and that was checkable via an abstract base class using implicit structural testing…

On Wed, Oct 23, 2019, at 19:00, Christopher Barker wrote:
There's lots of functionality that's on str that if I were designing the language I'd put on character. character type functions are definitely in - and, frankly, str.isnumeric is an attractive nuisance, it may well make sense to remove it from str and require explicit use of all(). upper/lower is tricky - cases like ß can change the length of a string... maybe put it on char but return a string? No reason not to allow + or * to concatenate chars to each other or to strings, multiply a char to a string

Since this is Python 4000, where everything's made up and the points don't matter... I think there shouldn't be a char type, and also strings shouldn't be iterable, or indexable by integers, or anything else that makes them appear to be tuples of code points. Nothing good can come of decomposing strings into Unicode code points. The code point abstraction is practically as low level as the internal byte encoding of the strings. Only lexing libraries should look at strings at that level, and you should use a well written and tested lexing library, not a hacky hand-coded lexer. Someone in this thread mentioned that they'd used ' '.join on a string in production code. Was the intent really to put a space between every pair of code points of an arbitrary string? Or did they know that only certain code points would appear in that string? A convenient way of splitting strings into more meaningful character units would make the actual intent clear in the code, and it would allow for runtime testing of the programmer's assumptions. Explicit access to code points should be ugly – s.__codepoints__, maybe. And that should be a sequence of integers, not strings like "́".
it’s probably worth at least considering making UTF-8 strings first-class objects. They can’t be randomly accessed,
They can be randomly accessed by abstract indices: objects that look similar to ints from C code, but that have no extractable integer value in Python code, so that they're independent of the underlying string representation. They can't be randomly accessed by code point index, but there's no reason you should ever want to randomly access a string by a code point index. It's a completely meaningless operation.

On Oct 25, 2019, at 20:44, Ben Rudiak-Gould <benrudiak@gmail.com> wrote:
Nothing good can come of decomposing strings into Unicode code points.
Which is why Unicode defines extended graphemes clusters, which are the intended approximation of human-language characters, and which can be made up of multiple code points. You don’t have to worry about combining and switch units and normalization, and so on. And it is definitely possible to design a language that deals with clusters efficiently. Swift does it, for example. And they treat code units as no more fundamental than the underlying code points. (And, as you’d expect, code points and UTF-8/16/32 code units are appropriately-sized integers, not chars.) In fact, Go treats the code units as _less_ fundamental: all strings are stored as UTF-8, so you can access the code units by just casting to a byte[], but if you want to iterate the code points (which are integers), you have to import functions for that, or encode to a UTF-32 not-a-string byte[], or use some special-case magic sugar. And Rust is essentially the same (but with more low-level stuff to write your own magic sugar, instead of hardcoded magic).
OK, but how do you write that lexer? Most people should just get it off PyPI, but someone has to write it an put it on PyPI, and it has to have access to either grapheme clusters, or code units, or code points, or there’s no way it can lex anything. Unless you’re suggesting that the lexing library needs to be written in C (and then you’ll need different ones for Jython, etc.)?
Sure, but as you argued, code points are almost never what you want. And clusters don’t have a fixed size, or integer indices; how would you represent them except as a char type?
You certainly can design a more complicated iteration/indexing protocol that handles this—C++, Swift, and Rust have exactly that, and even Python text files with seek and tell are roughly equivalent—but it’s explicitly not random access. You can only jump to a position you’ve already iterated to. For example, in Swift, you can’t write `s[..20]`, but if you write `s.first(of: ",")`, what you get back isn’t a number, it’s a String.Index, and you can use that in `s[..commaIndex]`. And a String.Index is not a random-access index, it’s only a bidirectional index—you can get the next or previous value, but you can’t get the Nth value (except in linear time, by calling next N times). Of course usually you don’t want to search for commas, you want to parse CSV or JSON or something so you don’t even care about this. But when you’re writing that CSV or JSON or whatever module, you do.
Yes, that’s my argument for why it’s likely acceptable that they can’t be randomly accessed, which I believe was the rest of the sentence that you cut off in the middle. However, I think that goes a _tiny_ bit too far. You _usually_ don’t want to randomly access a string, but not _never_. Let’s say you’re at the REPL, and you’re doing some exploratory hacking on a big hunk of text. Being able to use the result of that str.find from a few lines earlier in a slice is often handy. So it has to be something you can read and type easily. And it’s hard to get easier to type than an int. And notice that this is exactly how seek and tell work on text files. I don’t think the benefit of being able to avoid copy-pasting some ugly thing or repeating the find outweighs the benefit of not having to think in code units—but it is still a nonzero benefit. And the fact that Swift can’t do this in its quasi-REPL is sometimes a pain. Also, Rust lets you randomly access strings by UTF-8 byte position and then ask for the next character boundary from there, which is handy for hand-optimizing searches without having to cast things back and forth to [u8]. But again, I don’t think that benefit outweighs the benefit of not having to think in either code units or code points. Anyway, once you get rid of the ability to randomly access strings by code point, this means you don’t need to store strings as UTF-32 (or as Python’s clever UTF-8-or-16-or-32). When you read a UTF-8 text file (which is most text files you read nowadays). its buffer can already be the internal storage for a string. In fact, you can even mmap a UTF-8 text file and treat it as a Unicode string. (See RipGrep, which uses Rust’s ability to do this to make regex searching large files both faster and simpler at the same time, if it’s not obvious why this is nice.)

On Fri, Oct 25, 2019 at 08:44:17PM -0700, Ben Rudiak-Gould wrote:
Nothing good can come of decomposing strings into Unicode code points.
Sure there is. In Python, it's the fastest way to calculate the digit sum of an integer. It's also useful for implementing classical encryption algorithms, like Playfair. Introspection, e.g. if I want to know if a string contains any surrogates, I can do this: any('\uD800' <= c <= '\uDFFF' for c in s) Of perhaps I want to know if the string contains any "astral characters", in which case they aren't safe to pass to a Javascript or Tcl script which doesn't handle them correctly: any(c > '\uFFFF' for c in s) How about education? One of the things I can do with strings is: for c in string: print(unicodedata.name(c)) or possible even just # what is that weird symbol in position five? print(unicodedata.name(string[5])) to find out what that weird character is called, so I can look it up and find out what it means. Knowing stuff is good, right? Or do you think the world would be better off if it was really hard and "ugly" (your word) for people like me to find out what code points are called and what their meaning is? Rather than just telling us that we shouldn't be allowed to access code points in strings, would you please be explicit about *why* this access is a bad thing? And if code points are "bad", then what should we be allowed to do with strings? If code points is too low level, then what is an appropriate level? I guess you're probably going to mention grapheme clusters. (If you aren't, then I have no idea what your objection is based on.) Grapheme clusters are a hard problem to solve, since they are dependent on the language and the locale. There's a Unicode algorithm for splitting on graphemes, but it ignores the locale differences. Processing on graphemes is more expensive than on code points. There is, as far as I can tell, no O(1) access to graphemes in a string without pre-processing them and keeping a list of their indices. For many people, and for many purposes, paying that extra cost in either time or memory is just a total waste, since they're hardly ever going to come across a grapheme cluster. Few people have to process completely arbitrary strings: their data tends to come from a particular subset of natural language strings, and for some such languages, you might go a whole lifetime without coming across a grapheme cluster of more than one code point. (This may be slowly changing, even for American English, driven in part by the use of emoji and variation selectors.) If Python came with a grapheme processing API, I would probably use it. But in the meantime, the code point API is "good enough" for most things I do with strings. And for the rest, graphemes are too low-level: I need things like sentences; clauses, words, word stems, prefixes and suffixes, syllables etc. But even if Python had an excellent, fast grapheme API, I would still want a nice, clean, fast interface that operates on code-points. -- Steven

On Sun, Oct 13, 2019 at 12:41:55PM -0700, Andrew Barnert via Python-ideas wrote:
Indeed, and Guido did rule some time ago that 4.0 would be ordinary transition, like 3.7 to 3.8, not a big backwards breaking version change. I've taken up referring to some hypothetical future 3.0-like version as Python 5000 (not 4000) in analogy to Python 3000, but to emphasise just how far away it will be.
I don't see why you can't make arrays of UTF-8 indexable and provide random access to any code point. I understand that ``str`` in Micropython is implemented that way. The obvious implementation means that you lose O(1) indexing (to reach the N-th code point, you have to count from the beginning each time) but save memory over other encodings. (At worst, a code-point in UTF-8 takes three bytes, compared to four in UTF-16 or UTF-32.) There are ways to get back O(1) indexing, but they cost more memory. But why would you want an explicit UTF-8 string object? What benefit do you get from exposing the fact that the implementation happens to be UTF-8 rather than something else? (Not rhetorical questions.) If the UTF-8 object operates on the basis of Unicode code points, then its just a str, and the implementation is just an implementation detail. If the UTF-8 object operates on the basis of raw bytes, with no protection against malformed UTF-8 (e.g. allowing you to insert bytes 0x80-0xFF which are never valid in UTF-8, or by splitting apart a two- or three-byte UTF-8 sequence) then its just a bytes object (or bytearray) initialised with a UTF-8 sequence. That is, as I understand it, what languages like Go do. To paraphrase, they offer data types they *call* UTF-8 strings, except that they can contain arbitrary bytes and be invalid UTF-8. We can already do this, today, without the deeply misleading name: string.encode('utf-8') and then work with the bytes. I think this is even quite efficient in CPython's "Flexible string representation". For ASCII-only strings, the UTF-8 encoding uses the same storage as the original ASCII bytes. For others, the UTF-8 representation is cached for later use. So I don't see any advantage to this UTF-8 object. If the API works on code points, then it's just an implementation detail of str; if the API works on code units, that's just a fancy name for bytes. We already have both str and bytes so what is the purpose of this utf8 object? -- Steven

On Sat, Oct 26, 2019, 7:29 PM Steven D'Aprano
(At worst, a code-point in UTF-8 takes three bytes, compared to four in UTF-16 or UTF-32.)
http://www.fileformat.info/info/unicode/char/10000/index.htm

On Sat, Oct 26, 2019 at 07:38:19PM -0400, David Mertz wrote:
Oops, you're right, UTF-8 can use four code units (four bytes) too, I forgot about that. Thanks for the correction. So in the worst case, if your string consists of all (let's say) Linear-B syllables, UTF-8 will use four bytes per character, the same as UTF-32. But for strings consisting of a mix of (say) ASCII, Latin-1, etc with only a few Linear-B syllables, UTF-8 will use a lot less memory. -- Steven

Absolutely, utf-8 is a wonderful encoding. And indeed, worst case is the same storage requirement as utf-16 or utf-32. For O(1) random access into all strings, we have to eat 32-bits per character, one way or the other, but of course there are space/speed trade-offs one could make for intermediate behavior. On Sat, Oct 26, 2019, 7:58 PM Steven D'Aprano <steve@pearwood.info> wrote:

On Sat, Oct 26, 2019, at 20:26, David Mertz wrote:
A string representation considering of (say) a UTF-8 string, plus an auxiliary list of byte indices of, say, 256-codepoint-long chunks [along with perhaps a flag to say that the chunk is all-ASCII or not] would provide O(1) random access, though, of course, despite both being O(1), "single index access" vs "single index access then either another index access or up to 256 iterate-forward operations" aren't *really* the same speed.

Ok, true enough that dereferencing and limited linear search is still O(1). I could have phrased that slightly more precisely. But the trade-off part is true. Indexing into character 1 million of a utf-32 string is just one memory offset calculation, them following the reference. Indexing into the utf-8-with-offset-list is a couple dereferences, and on average 128 sequential scans. So it's not worse big-O, but it's around 100x slower... Still a lot faster than sequential scan of all 1 million though. What does actual CPython do currently to find that s[1_000_000], assuming utf-8 internal representation? On Sat, Oct 26, 2019, 11:02 PM Random832 <random832@fastmail.com> wrote:

PEP 393 The Unicode string type is changed to support multiple internal representations, depending on the character with the largest Unicode ordinal (1, 2, or 4 bytes) ... Ah, OK. I get it. One byte representation is only ASCII, which happens to match utf-8. Well, the latin-1 oddness. But the internal representation is utf-16 or utf-32 if the string contains code points requiring multi-byte representation. On Sun, Oct 27, 2019, 12:19 AM Chris Angelico <rosuav@gmail.com> wrote:

On Sat, Oct 26, 2019 at 11:34:34PM -0400, David Mertz wrote:
What does actual CPython do currently to find that s[1_000_000], assuming utf-8 internal representation?
CPython doesn't use a UTF-8 internal representation. MicroPython *may*, but I don't know if they do anything fancy to avoid O(N) indexing. IronPython and Jython use whatever .Net and Java use. CPython uses a custom implementation, the Flexible String Representation, which picks the smallest code unit size required to store all the characters in the string. # Pseudo-code c = max(string) # Highest code-point if c <= '\xFF': # effectively ASCII or Latin-1 use one byte per code point elif c <= '\uFFFF': # effectively UCS-2, or UTF-16 without the surregate pairs use two bytes per code point else: assert c <= '\U0001FFFF': # effectively UCS-4, or UTF-32 use four bytes per code point -- Steven

On Oct 26, 2019, at 21:33, Steven D'Aprano <steve@pearwood.info> wrote:
IronPython and Jython use whatever .Net and Java use.
Which makes them sequences of UTF-16 code units, not code points. Which is allowed for the Python 2.x unicode type, but would violate the rules for 3.x str, but neither one has a 3.x. If you want to deal with code points, you have to handle surrogates manually. (Actually, IIRC, one of the two has a str type that, despite being 2.x, is unicode rather than bytes, but with some extra undocumented functionality to smuggle bytes around in a str and have it sometimes work.)

On Sun, Oct 27, 2019, at 03:39, Andrew Barnert via Python-ideas wrote:
I do like the way GNU Emacs represents strings - abstractly, a string can contain any character, or any byte > 127 distinct from a character. Concretely, IIRC they are represented either as pure byte strings or as UTF-8 with "bytes > 127" represented as the extended UTF-8 representations of code points 0x3FFF80 through 0x3FFFFF [values between 0x110000 and 0x3FFF7F are used for other purposes].

On Oct 26, 2019, at 19:59, Random832 <random832@fastmail.com> wrote:
A string representation considering of (say) a UTF-8 string, plus an auxiliary list of byte indices of, say, 256-codepoint-long chunks [along with perhaps a flag to say that the chunk is all-ASCII or not] would provide O(1) random access, though, of course, despite both being O(1), "single index access" vs "single index access then either another index access or up to 256 iterate-forward operations" aren't *really* the same speed.
Yes, but that means constructing a string takes linear time, because you have to construct that index. You can’t just take a read/recv/mmap/result of a C library/whatever and use it as a string without doing linear work on it first. And you have to do that on _every_ string, even though you only need the index on a small percentage of them. (Unless you can statically look ahead at the code and prove that a string will never be indexed—which a Haskell compiler can do, but I don’t think it’s remotely feasible for a language like Python.) If you redesign your find, re.search, etc. APIs to not return character indexes, then I think you can get away with not having character-indexable strings. On the rare occasions where you need it, construct a tuple of chars. If that isn’t good enough, you can easily write a custom object that wraps a string and an index list together that acts like a string and a sequence of chars at the same time. There’s no need for the string type itself to do that.

On Sun, Oct 27, 2019 at 12:10:22AM -0700, Andrew Barnert via Python-ideas wrote:
If string.index(c) doesn't return the index of c in string, then what does it return? I think you are conflating the public API based on characters (to be precise: code points) for some underlying implementation based on bytes. Given zero-based indexing, and the string: "abÇÐεф" the index of "ф" better damn well be 5 rather than 8 (UTF-8), 10 (UTF-16) or 20 (UTF-32) or I'll be knocking on the API designer's door with a pitchfork and a flaming torch *wink* And returning <AbstractIndex object at 0xb7ce1bf0> is even worse. Strings might not be implemented as an array of characters. They could be a rope, a linked list, a piece table, a gap buffer, or something else. The public API which operates on code points should not depend on the implementation. Regardless of how your string is implemented, it is conceptually a sequential array of N code points indexed from 0 to N-1. -- Steven

On Sun, Oct 27, 2019 at 11:43 PM Steven D'Aprano <steve@pearwood.info> wrote:
And in response to the notion that you don't actually need the index, just a position marker... consider this: File "/home/rosuav/tmp/demo.py", line 1 print("Hello, world!') ^ SyntaxError: EOL while scanning string literal Well, either that, or we need to make it so that " "*<AbstractIndex object at 0xb7ce1bf0> results in the correct number of spaces to indent it to that position. That ought to bring in plenty of pitchforks... ChrisA

On Oct 27, 2019, at 05:49, Chris Angelico <rosuav@gmail.com> wrote:
So if those 12 glyphs take 14 code units because you’re using Stephen’s string and it’s in NFKD, getting 14 and then indenting two spaces too many (as Python does today) is not just a good-enough best effort, but something we actually want to ensure at all costs by making sure you always deal in code unit indexes?
Would you still bring pitchforks for " " * StrIndex(chars=12, points=14, bytes=22)? If so, then you require code to spell it as " " * index.chars instead of " " * index. It’s not like the namedtuple/structseq/dataclass/etc. repr is some innovative new idea nobody’s ever thought of to get a useful display, or like people can’t figure out how to get the index out of a regex match object. This is all simple stuff; I don’t get the incredulity that it could possibly be done. (Especially given that there are other languages that do exactly the same thing, like Swift, which ought to be proof that it’s not impossible.) (Could it be done without breaking a whole ton of existing code? I strongly doubt it. But my whole argument for why we shouldn’t be trying to “fix” strings in “Python 4000” in the first place is that the right fix probably cannot be done in a way that’s remotely feasible for backward compatibility. So I hope you wouldn’t expect that something additional that I suggested could be considered only if that unfeasible fix were implemented would itself be feasible.)

On Sun, Oct 27, 2019 at 10:07:41AM -0700, Andrew Barnert via Python-ideas wrote:
*scratches head* I'm not really sure how glyphs (the graphical representation of a character) comes into this discussion, but for what it's worth, I count 22, not 12 (excluding the leading spaces). I'm not really sure how you get "14 code units" either, since whatever internal representation you use (ASCII, Latin-1, UTF-8, UTF-16, UTF-32) the string will be one code unit per entity, whether we are counting code points, characters or glyphs, since it is all ASCII. I don't know of any encoding where ASCII characters require more than one code unit.
because you’re using Stephen’s string and it’s in NFKD, getting 14 and then indenting two spaces too many (as Python does today)
You mean something like this? py> value = äë +* 42 File "<stdin>", line 1 value = äë +* 42 ^ SyntaxError: invalid syntax (the identifier is 'a\N{COMBINING DIAERESIS}e\N{COMBINING DIAERESIS}') Yes, that looks like a bug to me, but a super low priority one to fix. (This is assuming that the Python interpreter promises to line the caret up with the offending symbol "always", rather than just making a best effort to do so.) And probably tough to fix too: I think you need to count in grapheme clusters, not code points, but even that won't fix the issue since it leaves you open to the *opposite* problem of undercounting if the terminal or IDE fails to display combining characters properly: value = a¨e¨ +* 42 ^ SyntaxError: invalid syntax I had to fake the above, because I couldn't find a terminal on my system which would misdisplay COMBINING DIAERESIS, but I've seen editors do it. It's not just a problem with combining characters. If I recall correctly the Unicode standard doesn't require variant selectors to be displayed as a single glyph. So you might not know how wide a grapheme cluster is unless you know the capabilities of the application displaying it. Handling text in its full generality, including combining characters, emojis, flags, East Asian wide character, etc, is really tough to do right. For the Python interpreter, it would require a huge amount of extra work for barely any payoff since 99.9% of Python syntax errors are not going to include any of the funny cases. As I think I said earlier, if Python had an API that understood grapheme clusters, I would probably use it in preference to the code point API for most text handling code. But let's not make the perfect the enemy of the good: if you have a line of source code which contains flags, Asian wide characters, combining accents, emoji selectors etc and the caret doesn't quite line up in the right place, oh well, que sera sera. [...]
Hell yes. If I need 12 spaces, why should I be forced to work out how many bytes the interpreter uses for that? Why should I care? I want 12 spaces, I don't care if that's 12 bytes or 24 or 48 or 7. I might not even know what Python's internal encoding is. Many people don't. To say nothing of the obnoxiousness of forcing me to write 39 characters "StrIndex(...)" when two would do. What if I get it wrong, and think that 12 characters is 6 points and 42 bytes when its actually 8 points and 46 bytes? Working in code points is not perfect, but "code point == character" is still an acceptable approximation for most uses of Python. And as Unicode continues to gather more momentum, eventually we'll need more powerful, more abstract but also more complicated APIs. But forcing the coder to go from things they work with ("I want 12 smiley faces") to trying to work with the internal implementation is a recipe for chaos: "Each smiley is two code points, a base and a variant selector, but they're astral characters so I have to double it, so that's 48 code points, and each code point is four bytes, so that's 188 bytes, wait, do I include the BOM or not?"
Can you link to an explanation of what Swift *actually* does, in detail?
(Could it be done without breaking a whole ton of existing code? I strongly doubt it.
Of course it can be: we leave the code-point string API alone, as it is now, and build a second API based on grapheme clusters, emoji variants etc to run in parallel. This allows people to transition from one to the other if and when they need to, rather than forcing them to pay the cost of working in graphemes whether they need it or not. -- Steven

On Oct 27, 2019, at 18:00, Steven D'Aprano <steve@pearwood.info> wrote:
Because, assuming your using a mono space font, the number of glyphs to the error is how many spaces you need to indent. This example happens to be pure ASCII, so the count of glyphs, extended grapheme clusters, code units, and code points happens to be the name. But just change that e to an è made of two combining code units—like the ç in your previous example might have been—and now there are still the same number of glyphs and clusters; but one fewer code point and one fewer code unit. Extended grapheme clusters are intended to be the best approximation of “characters” in the Unicode standard. Code units are not.
but for what it's worth, I count 22, not 12 (excluding the leading spaces).
Sorry; that was a typo. Plus, I miscounted on top of the typo; I meant to count the spaces.
Yes. It this is a general bug: everywhere that you count code units intending to use that as a count of glyphs or characters, both in Python itself and in third-party libraries and in applications. This is one of the most trivial examples, and you obviously wouldn’t break backward compatibility with everything solely to fix this example. And I don’t know why I have to keep repeating this, but one more time: I’m not proposing to change Python, I’m arguing to _not_ change Python, because it’s already good enough, and the suggested improvement wouldn’t make it right despite breaking lots of code, and making it right is a big thing that would break even more code. If I were designing a new language, I would do it right from the start, and it would not have this bug, or any of the other manifestations of the same issue, but Python 4000 (or even 5000) is not an opportunity to design a new language. (And to be clear: Python’s design made perfect sense when it was chosen; Unicode has just gotten more complicated since then. In fact, most other languages that adopted Unicode as early as Python got permanently stuck with the UCS-2 assumption, forcing all user code to deal with UTF-16 code units forever.)
Well, the reason I called it a good-enough best effort is because I assume that it’s only meant to be a best effort, and I think it’s good enough for that. I’m not the one who said people would be up in arms if that were broken, I’m the one arguing that people are fine with it being broken as long as it’s usually good enough.
And probably tough to fix too: I think you need to count in grapheme clusters, not code points,
Yes, that’s the whole point of the message you were responding to: extended grapheme clusters are the Unicode approximation of characters; code units are not. And a randomly-accessible sequence of grapheme clusters is impossible to do efficiently, but a sized iterable container, or a sequence-like thing that’s indexable by special indexes but not by integers, is. So tying the string type even more closely to code units would not fix it; changing the way it works as a Sequence would not fix it.
That’s a matter of working around broken editors, terminals; and IDEs—which do exist, but are uncommon, and getting less common. Not having a workaround for a broken editor that most people don’t use is not a bug in the same way as being broken in a properly-working environment is. (Not having a workaround for something broken that half the users in the world have to deal with, like Windows cmd.exe, would be a different story, of course. You can claim that it’s Windows’ bug, not yours, but that won’t make users happy. But I’m pretty sure that’s not an issue here.)
Obviously you wouldn’t redesign the whole text API just to make syntax error carets line up. You would do that to make thousands of different things easier to write correctly, and lining up those carets is just one of those things, and nowhere near the most important one.
If you know you need 12 spaces, you just multiply by 12; why do you think you need to work anything out? Adding str * StrIndex doesn’t require taking away str * int. Your example implied that you would be working out that count in some way—say, by calling str.find—and that you and many others would be horrified if that return value were not an integer, but you could multiply it by a string anyway. I don’t know why you see anything wrong with that, but I guessed that maybe it was because you couldn’t see, at the REPL, how many spaces you were multiplying. Having the thing that’s returned by str.find have the repr CharIndex(chars=12, points=14, bytes=22) instead of the generic repr would solve that. If that isn’t your problem with being able to multiply a str by a StrIndex, then I have no other guesses for what you think people would be raising pitchforks over.
The reference documentation for String starts at https://developer.apple.com/documentation/swift/string. (It should be the first thing that comes up in any search engine for Swift string.) You can follow the links from there to String.Index and String.Iterator, and from either of those to BidirectionalCollection, and from there to Collection, which explains how indexing works in general. There’s probably an easier to understand description at https://docs.swift.org/swift-book but it may not explain *exactly* what it does, because it’s meant as a user guide. Two things that may be confusing: Swift uses the exact same words as Python for its iteration/etc. protocols but all with different meanings (e.g., a Swift Sequence is a Python Iterable; a Python Sequence is a Swift IndexableCollection; etc.), and Swift makes heavy use of static typing (e.g., just as there are no separate display literals for Array, Set, etc., there are no separate display literals for Character and String; the literal "x" is a Character if you store it in a Character lvalue, a len-1 String if you store it in a String, and a TypeError if you sort it in a double).

I think that we're more or less in broad agreement, but I wanted to comment on this: On Sun, Oct 27, 2019 at 09:41:00PM -0700, Andrew Barnert wrote:
I don't think that's quite correct. See: http://www.unicode.org/glossary/#abstract_character http://www.unicode.org/glossary/#character http://www.unicode.org/glossary/#extended_grapheme_cluster http://www.unicode.org/glossary/#code_point From the glossay definition of code point: "A value, or position, for a character, in any coded character set." In other words, the code point is a numeric code such as U+041 that represents a character such as "A". (Except when it is a numeric code that represents a non-character.) And from definitions D60 and D61 here: http://www.unicode.org/versions/Unicode12.1.0/ch03.pdf "Grapheme clusters and extended grapheme clusters may not have any particular linguistic significance" "The grapheme cluster represents a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean SYLLABLE) together with any number of nonspacing marks applied to it." [Emphasis added.] "A grapheme cluster is similar, but not identical to a combining character sequence." So it is much more complicated than just "code point != character, extended grapheme cluster = character". Lots of code points are characters; lots of graphemes aren't characters but syllables or some other linguistic entity, or no linguistic entity at all; and lots of things that are characters aren't graphemes, such such combining character sequences. And none of this mentions what to do with variation selectors, flags etc. The whole thing is very complicated and I don't pretend to understand all the details. (Until now, I thought that combining character sequences were grapheme clusters. Apparently they aren't.) -- Steven

No, what I’m doing is avoiding conflating the public API based on characters with the underlying representation based on code points, treating them no more fundamental than the code units. You can still iterate the code points if you want to, because that’s occasionally useful. And you can also iterate the UTF-8 code units, because that’s also occasionally useful.
Really? Even if the string is in NFKD, as it would be if this were, say, the name of a file on a standard Mac file system? Then that Ç character is stored as the code unit U+0043 followed by the code unit U+0327, rather than the single unit U+00D0. So had it still better be 5, not 6? If so, Python 3 is broken, and always has been; where’s your pitchfork? And what were you going to do with that 5 anyway that it has to be an integer? Without a use case, you’re just demanding infinite flexibility regardless of what the cost might be. You _could_ make this work by building a grapheme cluster index at construction time for every string, or by storing strings as an array of grapheme clusters that are themselves arrays of code points rather than as a flat array, or by normalizing every string at construction time. But do you actually want to do any of those things; or is guaranteeing 5 rather than 6 there not worth the cost? Also, have you ever used seek and tell on a text file? What do you think tell gives you? According to the language spec; it could be anything and you have to treat it as an abstract index; I think in current CPython it’s a byte index. Where’s your pitchfork there?
And returning <AbstractIndex object at 0xb7ce1bf0> is even worse.
Why? That object can be used to index/slice/start finding at/etc. I suggested earlier that it could also have attributes that give you the integer character, code unit (byte), and, if you really want it, code point index. If you have a use for one of those, you use the one you have a use for. If not, why do you need it to be equal to any of those three integers, much less the least useful of them? If you’re just concerned about the REPL, then it can be <CharIndex(5) at 0xb7ce1bf0>, or even something eval-able like CharIndex(chars=5, units=6, bytes=10). Which isn’t as nice as a number I can just spot a few lines back and retype (as I mentioned before, this is occasionally an annoyance when dealing with Swift), but that’s a tradeoff that allows you to see the number 5 that you’re insisting you’d better be able to get even though you can’t actually use the number 5.
If you want a public API that’s independent of implementation, where a string could be a linked list, then you want a public API that doesn’t include indexing. If your language comes with fundamental builtin types where the [] operator takes linear time, then your language doesn’t have a [] operator like Python’s, or C++’s or most other languages with the same syntax; it has something that looks misleadingly like [] in other languages but has to be used differently.

On Sun, Oct 27, 2019, at 03:10, Andrew Barnert wrote:
constructing a string already takes linear time because you have to copy it into memory managed by the python garbage collector. And you can track whether you'll need the index in one pass while copying, rather than, as currently, having to do one pass to pick a representation and another to actually perform the copying and conversion, so my suggestion may have a cache locality advantage over the other way.

On Nov 2, 2019, at 20:33, Random832 <random832@fastmail.com> wrote:
Not necessarily. There are certainly _some_ strings that come into the interpreter (or extension module) as externally-allocated things that have to be copied. But not all, or even most, strings. Things like reading a file or recving from a socket, you allocate a buffer which is managed by your GC, and the string gets placed there, so there’s no need to copy it. When you mmap a file, you know the lifetime of the string is the lifetime of the mmap, so you don’t track it separately, much less copy it. And so on. Also, even when you do need to copy, a memcpy is a whole lot faster than a loop, even though they are both linear. Especially when that loop has additional operations (maybe even including a conditional that branches 80/20 or worse). But even without that, copying byte by byte, rather than by whatever chunks the CPU likes, can already be 16x as slow. Go often ends up copying strings unnecessarily, but the memcpy is still so much faster than the decode that Java/C#/Python/Ruby does that Go fanatics like to brag about their fast text handling (until you show them some Rust to Swift code that’s even faster as well as more readable…).
And you can track whether you'll need the index in one pass while copying, rather than, as currently, having to do one pass to pick a representation and another to actually perform the copying and conversion, so my suggestion may have a cache locality advantage over the other way.
Sure, the existing implementation of building strings is slow, and that’s what keeping strings in UTF-8 is intended to solve, and if your suggestion makes it take 1/4th as long (which seems possible, but obviously it’s just a number I pulled out of thin air), that’s nice—but nowhere near as nice as completely eliminating that cost. And most strings, you never need to randomly access (or only need to randomly access because other parts of the API, like str.find and re.search, make you), so why should you pay any cost, even if it’s only 1/4th the cost you pay in Python 3.8? (Also, for some random-access uses, it really is going to be faster to just decode to UTF-32 and subscript that; why build an index plus decoding when you can just decode?) If you’re already making a radical breaking change, why not get the radical benefits? Also, consider this: if str is unindexed and non-subscriptable, it’s trivial to build a class IndexedStr(str) whose __new__ builds the index (or copies it even passed an IndexedStr) and that adds __getitem__, while still acting as a str even at the C API level. Whenever you need random access, you construct an IndexedStr, the rest of the time you don’t bother. And you can even create special-purpose variants for special strings (I know this is always ASCII, or I know it’s always under 16 chars…) or specific use cases (I know I’m going to iterate backward, so repeatedly counting forward from index[idx%16] would be hugely wasteful). But if str builds an index, there’s no way to write a class FastStr(str) that skips that, or any of the variants that does it differently.

On Oct 26, 2019, at 16:28, Steven D'Aprano <steve@pearwood.info> wrote:
That _could_ change, especially if 3.9 is followed by 3.10 (or has that already been rejected?). But I think almost everyone agrees with Guido, and that’ll probably be true until the memory of 2.7 fades (a few years after Apple stops shipping it and the last Linux distros go out of LTS). I guess your 5000 implies about 16 years off, so… ok. But at that point, it makes as much sense to talk about a hypothetical new Python-like language.
Most of the time, you really don’t need random access to strings—except in the case where you got that integer index back from a the find method or a regex match object or something, in which case using Swift-style non-integer indexes, or Rust-style (and Python file object seek/tell) byte offsets, solves the problem just as well. But when you do want it, it’s very likely you don’t want it to take linear time. Providing indexing, but having it be unacceptably slow for anything but small strings, isn’t providing a useful feature, it’s providing a cruel tease. Logarithmic time is probably acceptable, but building that index takes linear time, so now constructing strings becomes slow, which is even worse (especially since it affects even strings you were never going to randomly access).
For novices who only deal with UTF-8, it might mean never having to call encode or decode again. But the real benefit is to enable low-level code (that in turn makes high-level code easier to write). Have you ever written code that mmaps a text file and processes it as text? You either have to treat it as bytes and not do proper Unicode (which works for some string operations—until the first time you get some data where it doesn’t), or implement all the Unicode algorithms yourself (especially fun if what you’re trying to do is, say, a regex search), or put a buffer in front of it and decode on the fly, defeating the whole point of mmap. Have you ever read an HTTP header as bytes to verify that it’s UTF-8 and then tried to switch to using the same socket connection as a text file object rather than binary? It’s doable, but it’s a pain. And the reason all of this is a pain is that when Python (and Java and Ruby and so on) added Unicode support, the idea of assuming most files and protocols and streams are UTF-8 was ridiculous. Making UTF-8 a little easier to deal with by making everything else either slower or harder to deal with was a terrible trade off then. But in 2019—much less in Python 5000-land—that’s no longer true.
If the UTF-8 object operates on the basis of Unicode code points, then its just a str, and the implementation is just an implementation detail.
Ideally, it can iterate any of code units (bytes), code points, or grapheme clusters, not just one. Because they’re all useful at different times. But most of the string methods would be in terms of grapheme clusters.
What’s this about inserting bytes? I’m not suggesting making strings mutable; that’s insane even for 5.0. :) Anyway, it’s just a bytes object with all of the string methods, and that duck types as a string for all third-party string functions and so on, which is a lot different than “just a bytes object”. But a much better way to see it is that it’s a str object that also offers direct access to its UTF-8 bytes. Which you don’t usually need, but it is sometimes useful. And it would be more useful if things like sockets and pipes and so on had UTF-8 modes where they could just send UTF-8 strings, without you having to manually wrap them in a TextIOWrapper with non-default args first. This would require lots of changes to the stdlib and to tons of existing third-party code, to the extent that I’m not sure even “Python 5000” makes it ok, but for a new Python-inspired language, that’s a different story…
We had to decode it from UTF-8 and encode it back. Sure, it gets cached so we don’t have to keep doing that over and over. But leaving it as UTF-8 in the first place means we don’t have to do it at all. Of course this is only true if the source literal or text file or API or network protocol or whatever was encoded in UTF-8. But most of them are. (For the rest, yes, we still have to decode from UTF-16-LE or Shift-JIS or cp1252 or whatever and re-encode as UTF-8—albeit with a minor shortcut for the first example. But that’s no worse than today, and it’s getting less common all the time anyway.)
Since we’re now talking 5000 rather than 4000, this could replace str rather than be in addition to it. And it would also replace many uses of bytes. People would still need bytes when they want a raw buffer of something that isn’t text, and when they want a buffer of something that’s not known to be UTF-8 (like the HTTP example–you start with bytes, then switch to utf8 once you know the encoding is utf8 or stick a stream decoder in front of it if it turns out not to be), but when you want a buffer of encoded text, the string is the buffer.

On Oct 13, 2019, at 12:02, Steve Jorgensen <stevej@stevej.name> wrote:
There are many cases in which it is awkward that testing whether an object is a sequence returns `True` for instances of of `str`, `bytes`, etc.
This proposal is a serious breakage of backward compatibility, so would be something for Python 4.x, not 3.x.
I’m pretty sure almost nobody wants a 3.0-like break again, so this will probably never happen.
Instead of those objects _being_ sequences, have them provide views that are sequences using a method named something like `members` or `items`.
Nothing else in Python works like this. Dicts do have an `items` method, but that provides an iterable (but not indexable) view of key-value pairs, while the dict itself is an iterable if its keys. So I think this would be pretty confusing. Also, would you want them to not be iterable either? If so, that would break even more code; if not, I don’t think it would actually solve that much in the first place. The main problem is that a str is a sequence of single-character str, each of which is a one-element sequence of itself, etc. forever. If you wanted to change this, I think it would make more sense to go the opposite way: leave str a sequence, but make it a sequence of char objects. (And likewise, bytes and bytearray could be sequences of byte objects—or just go all the way to making them sequences of ints.) And then maybe add a c prefix for defining char constants, and you’ve solved all the problems without having to add new confusing methods or properties. Meanwhile, the most common places you run into this problem are in functions that take a single str argument or a single iterable-of-str argument. Most such cases have already been solved by taking a str or tuple-of-str, which is clunky, even it’s worked since Python 0.9. But a better solution for almost all such cases is to just change the function to take a *args parameter for 0 or more string arguments. While we’re at it, if you really wanted to make a radical breaking change to Python involving view objects, I’d prefer one that expanded on dict views, to make all kinds of lazy view objects that are sequences or sets (e.g., calling map on a sequence gives you a sequence that’s computed on the fly; filtering a set gives you a set; reversing a sequence gives you a sequence; etc.), rather than making something else that’s kind of similar but doesn’t work the same way. And finally, if you want to break strings, it’s probably worth at least considering making UTF-8 strings first-class objects. They can’t be randomly accessed, but with an iterable-plus API like files, with seek/tell, or a new more powerful iterable API like Swift or C++, a lot of languages have found that to be a useful trade off anyway. But again, I doubt any of this is likely to happen, as nobody wants to go through another decade-long painful transition unless the benefits are a whole lot bigger than fixing a couple of minor things people have already learned how to deal with.

On Mon, Oct 14, 2019 at 6:49 AM Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
And finally, if you want to break strings, it’s probably worth at least considering making UTF-8 strings first-class objects. They can’t be randomly accessed, but with an iterable-plus API like files, with seek/tell, or a new more powerful iterable API like Swift or C++, a lot of languages have found that to be a useful trade off anyway.
Breaking the str type to do this seems like a really REALLY bad idea, but if you want a first-class UTF8String, you can certainly have it. Build it on top of some sort of byte buffer (maybe bytearray rather than bytes) with a whole lot of handy methods, and there you are. ChrisA

Yup. I think you're absolutely right. After I posted this, I had a better idea: https://mail.python.org/archives/list/python-ideas@python.org/thread/OVP6SIO...

On Sun, Oct 13, 2019 at 12:52 PM Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
I've thought for a long time that this would be a "good thing". the "string or sequence of strings" issues is pretty much the only hidden-bug-triggering type error I've gotten since "true division". The only way we really live with it fairly easily is that strings are pretty much never duck typed -- so I can check if I got a string, and then I know I didn't get a sequence of strings. But I've always wondered how disruptive it would be to add a char type -- it doesn't seem like it would be very disruptive, but I have not thought it through at all. And I'm not sure how much string functionality a char should have -- probably next to none, as the point is that it would be easy to distinguish from a string that happened to have one character. By the way, the bytes and bytearray types already does this -- index into or loop through a bytes object, you get an int. -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

On Oct 23, 2019, at 16:00, Christopher Barker <pythonchb@gmail.com> wrote:
Well, just adding a char type (and presumably a way of defining char literals) wouldn’t be too disruptive. But changing str to iterate chars instead of strs, that probably would be. Also, you’d have to go through a lot of functions and decide what types they should take. For example, does str.join still accept a string instead of an iterable of strings? Does it accept other iterables of char too? (I have used ' '.join on a string in real life production code, even if I did feel guilty about it…) Can you pass a char to str.__contains__ or str.endswith? What about a tuple of chars? Or should we take the backward-compat breaking opportunity to eliminate the “str or tuple of str” thing and instead use *args, or at least change it to “str or iterable of str (which no longer includes str itself)”?
And I'm not sure how much string functionality a char should have -- probably next to none, as the point is that it would be easy to distinguish from a string that happened to have one character.
Surely you’d want to be able to do things like isdigit or swapcase. Even C has functions to do most of that kind of stuff on chars. But I think that, other than join and maybe encode and translate, there’s an obvious right answer for every str method and operator, so this isn’t too much of a problem. Speaking of operators, should char+int and char-int and char-char be legal? (What about char%int? A thousand students doing the rot13 assignment would rejoice, but allowing % without * and // is kind of weird, and allowing * and // even weirder—as well as potentially confusing with str*int being legal but meaning something very different.)
By the way, the bytes and bytearray types already does this -- index into or loop through a bytes object, you get an int.
Sure, but b'abc'.find(66) is -1, and b'abc'.replace(66, 70) is a TypeError, and so on. Fixing those inconsistencies is what I meant by “go all the way to making them sequences of ints”. But it might be friendlier to undo the changes and instead add a byte type like the char type for bytes to be a sequence of. I’m not sure which is better. But anyway, I think all of these questions are questions for a new language. If making str not iterate str was too big a change even for 3.0, how could it be reasonable for any future version?

Well, just adding a char type (and presumably a way of defining char
There's a reason I've never actually proposed adding a char .... On Wed, Oct 23, 2019 at 5:34 PM Andrew Barnert <abarnert@yahoo.com> wrote: literals) wouldn’t be too disruptive. sure.
But changing str to iterate chars instead of strs, that probably would be.
Also, you’d have to go through a lot of functions and decide what types
And that would be the whole point -- a char type by itself isn't very useful. in some ssense, the only difference between a char and a str would be that a char isn't iterable -- but the benefit would be that a string is an iterable (and sequence) of chars, rather than an (infinitely recursable) iterable of strings. they should take. sure would -- a lot of thought to see how disruptive it would be ...
For example, does str.join still accept a string instead of an iterable of strings? Does it accept other iterables of char too?
if it accepted an iterable of either char or str, then I *think* there would be little disruption.
Can you pass a char to str.__contains__
yes, that's a no brainer, the whole point is that a string would be a sequence of chars.
or str.endswith?
I would think so -- a char would behave like a length-one string as much as possible.
What about a tuple of chars?
Or should we take the backward-compat breaking opportunity to eliminate
that's an odd one -- but I'm not sutre I see the point, if you have a tuple of chars, you could "".join() them if you want a string, in any context. the “str or tuple of str” thing and instead use *args, or at least change it to “str or iterable of str (which no longer includes str itself)”? Is this for .endswith() and friends? if so, there was discussion a while back about that -- but probably not the time to introduce even more backward incompatible changes. And I'm not sure how much string functionality a char should have -- probably next to none, as the point is that it would be easy to distinguish from a string that happened to have one character.
Surely you’d want to be able to do things like isdigit or swapcase. Even C has functions to do most of that kind of stuff on chars.
probably -- it would be least disruptive for a char to act as much as possible the same as a length-one string -- so maybe inexorability and indexability would be it.
But I think that, other than join and maybe encode and translate,
there’s an obvious right answer for every str method and operator, so
not sure why encode or translate should be an issue off the top of my head -- it would surley be a unicode char :-) this isn’t too much of a problem. well, we'd have to go through all of them, and do a lot of thinking... I think the greater confusion is where can you use a char instead of a string in other places? using it as a filename, for instance would make it pointless for at least the cases I commonly deal with (list of filenames). I can only imagine how many "things" take a string where a char would make sense, but then it gets harder to distinguish them all.
Fixing those inconsistencies is what I meant by “go all the way to making
I would say no -- in C a char IS an unsigned 8bit int, but that's C -- in pyhton a char and a number are very diferent things. ord() and chr() would work, of course. By the way, the bytes and bytearray types already does this -- index into or loop through a bytes object, you get an int. Sure, but b'abc'.find(66) is -1, and b'abc'.replace(66, 70) is a TypeError, and so on. I wonder if they need to be -- would we need a "byte" type, or would it be OK to accept an int in all those sorts of places? them sequences of ints”. But it might be friendlier to undo the changes and instead add a byte type like the char type for bytes to be a sequence of. I’m not sure which is better. me neither.
Well, I don't know that it was seriously considered -- with the Unicode changes, that WOULD have been the time to do it! Again though,, it seems like it would be pretty disruptive, so a non-starter, but maybe not? -CHB -- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Christopher Barker wrote:
I've always wondered how disruptive it would be to add a char type
I'm not sure if it would help much. Usually the problem with strings being sequences of strings lies in the fact that they're sequences at all. Code that operates generically on nested sequences often has to special-case strings so that it can treat them as atomic values. Having them be sequences of something else wouldn't change that. If I were to advocate changing anything in that area, it would be to make strings not be sequences. They would support slicing, but not indexing single characters, and would not be directly iterable. If you really wanted to iterate over characters, there could be a method such as s.chars() giving a sequence view. But that would be a disruptive enough change for so little benefit that I don't expect it to ever happen. -- Greg

On Thu, Oct 24, 2019 at 1:13 AM Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
wouldn't it? once you got to an object that couldn't be iterated, you'd know you had an atomic value. And this is why I was thinking that chars had less functionality, it would work. This is really common code for me that I need to type check: for filename in sequence_of_filenames: open(filename) if a char could not be used as a filename, then I'd get a similar error if a single string was passed in as if a list of numbers was passed in, say. That is, a string is a sequence of chars, not a sequence of strings. and a char can not be used as a string in many contexts. If I were to advocate changing anything in that area, it would
you are right -- that would be a great solution to the above problem. And I can't think of many real uses for iterating strings where you don't know that you want the chars, so .chars() iterator, or maybe str.iter_chars() would be fine. Something tells me that I've had uses for char in other contexts, but I can't think of them now, so maybe not :-) But again -- too disruptive, we've lived with this for a LONG time. -CHB
-- Christopher Barker, PhD Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython

Christopher Barker wrote:
wouldn't it? once you got to an object that couldn't be iterated, you'd know you had an atomic value.
I'm thinking of things like a function to recursively flatten a nested list. You probably want it to stop when it gets to a string, and not flatten the string into a list of characters. -- Greg

On Oct 24, 2019, at 14:13, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
A function to recursively flatten a nested list should only work on lists; it should stop on a string, but it should also stop on a namedtuple or a 2x2 ndarray or a dict. A function to recursively flatten arbitrary iterables, on the other hand… And I don’t think there’s any conceptual problem with strings being iterable. A C++ string is a sequence of chars. A Haskell string is a plain old (lazy linked) list of chars. And similarly in lots of other languages. And it’s rarely a problem. There are other differences that might be relevant here; I don’t think they are, but to be fair: C++ and Haskell implementations are expected to optimize everything well enough that you can just any arbitrary sequence of chars as a string with reasonable efficiency, so strings being a thin convenience wrapper above that makes intuitive sense. In Python, that isn’t true; a function that loops character by character would often be too slow to use. C++ and Haskell type systems make it a little easier to say “Here’s a function that works on generic iterables of T, but when T is char here’s a more specific function”. But again, I don’t think either of these is the reason Python strings being iterable is a problem; I think it really is primarily about them being iterables of strings.

On Thu, 24 Oct 2019 at 23:47, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
But again, I don’t think either of these is the reason Python strings being iterable is a problem; I think it really is primarily about them being iterables of strings.
The *real* problem is that there's a whole load of functions that would need rewriting to accept "character or string" arguments - or a whole load of debating over whether they should only use one or another. Examples: "a" in char_string, vs "word" in sentence. Both useful and used in real code. list_of_stuff.join(",") vs list_of_stuff.join(", ") list_of_characters.join("") And return values: string.partition(sep) - if sep is a character, should the middle return value be a character too? Remember that the typing module needs to be able to express the type signatures of all these functions, in a way that usefully allows checking usage. Plus many, many more. And not just in the stdlib, but in millions of lines of 3rd party code. And no cheating by saying these are cases where you should use 1-character strings. The fact that people don't typically distinguish between characters and 1-character strings in real life is precisely why it's useful that Python currently doesn't make a distinction either. In case it's not glaringly obvious, I'm -1 on this, even as an exercise in speculation :-) Paul

On Oct 25, 2019, at 01:34, Paul Moore <p.f.moore@gmail.com> wrote:
That’s not the problem that this thread is trying to solve, it’s the problem with the solution to that problem. :) I’ve already said that I don’t think this is feasible, because it would be too much work and too disruptive to compatibility. If you were designing a new Python-like language today, or if you had a time machine back to the 90s, it would be a different story. But for Python 4.0—even if we wanted a 3.0-like break, which I don’t think anyone does—we can’t break all of the millions of lines of code dealing with strings in a way that forces people to not just fix that code but rethink the design.
Many of your examples are not cases where people should use 1-character strings; they’re cases where we need polymorphic APIs. But that’s not a problem. Countless other languages already do exactly that, and people use them every day. (The languages that are too weak to do that kind of polymorphism, like C and PHP, instead have to offer separate functions like strstr vs. strchr, which is manifestly less friendly. Fortunately, fact that Python’s core API was originally loosely based on C’s string.h wouldn’t in any way force the same problem on Python or a Python-like language.) And, while there are plenty of functions that would need to treat char and 1-char str the same, there are also many—not just iter—that should only work for one or the other, such as ord, or re.search. And there are also functions that should work for both, but not do exactly the same thing—char replacement is the same thing as translate; substring replacement is not. In fact, notice what happens if you call ord on a 2-char string today: TypeError: ord() expected a character, but string of length 2 found. Python is _pretending_ that it has a character type even though it doesn’t. And you find the same thing in third-party code: some of the functions people have written would need to handle str and char polymorphically, some would need to handle both but with different logic, and many would need to handle just one or the other. Which is exactly why it couldn’t be fixed with a 3to4 script, and people would instead need to rethink the design of a bunch of their existing functions before they could upgrade.
In case it's not glaringly obvious, I'm -1 on this, even as an exercise in speculation :-)
I’m -1 on this, but I think speculating on how it could be solved is the best way to show that it can’t and shouldn’t be solved.

25.10.19 15:53, Andrew Barnert via Python-ideas пише:
If you were designing a new Python-like language today, or if you had a time machine back to the 90s, it would be a different story.
Interesting, how far in past you will need to travel? Initially builtin types did not have methods or properties, and the iterable protocol did not exist. Adding this will require too much work, and I am not sure Guido would like how much complexity it adds to his simple language.

On Oct 25, 2019, at 06:26, Serhiy Storchaka <storchaka@gmail.com> wrote:
Well, the str methods are largely carried over from the functions in the string module, which was there before 1.0. And I think the ord builtin goes back pretty far as well. So ideally, back to the start. On the other hand, it’s not like there was a huge ecosystem of third-party modules using Python 0.9 that Guido couldn’t afford to break, so If your time machine couldn’t go back quite that far, it might be ok to do it as late as, say, 2.2.
Adding this will require too much work, and I am not sure Guido would like how much complexity it adds to his simple language.
Adding it today would certainly require too much work—not so much for Python itself as for thousands of third-party libraries and even more applications, but that’s even worse. Even if it weren’t just to fix a small wart that people have been dealing with for decades, it would be too much. That’s why I’m -1 on it. But adding it in early Python would have been very little work. Just add one more builtin type, and change a dozen or so builtin and string module functions, and you’re done. And it would be very little complexity, too. Yes, it would mean one more built in type, but nothing else is complex about it. The Smalltalk guys who were advertising that their whole language fits on an index card could handle chars. And it’s not like it would be an unprecedented innovation—most languages that existed at the time, from Lisp to Perl, had strings as sequences of either chars or integers (or, as with C, of chars but char is just a type of integer). If anything, it was ABC that was innovative for making strings out of strings (although Tcl and BASIC also do); it just turned out to collide with other innovations that Python got over the next couple of years, in ways nobody could have imagined. And the hard bit would be describing what the Python 2.x language and ecosystem would look like to Guido so you could explain why it would matter. The idea that his language would have an “iterable protocol” that could be implemented by not just lists and strings but also user-defined types using special methods, and automatically by functions using a special coroutine semantics, and also by some syntax borrowed from Haskell, and that was checkable via an abstract base class using implicit structural testing…

On Wed, Oct 23, 2019, at 19:00, Christopher Barker wrote:
There's lots of functionality that's on str that if I were designing the language I'd put on character. character type functions are definitely in - and, frankly, str.isnumeric is an attractive nuisance, it may well make sense to remove it from str and require explicit use of all(). upper/lower is tricky - cases like ß can change the length of a string... maybe put it on char but return a string? No reason not to allow + or * to concatenate chars to each other or to strings, multiply a char to a string

Since this is Python 4000, where everything's made up and the points don't matter... I think there shouldn't be a char type, and also strings shouldn't be iterable, or indexable by integers, or anything else that makes them appear to be tuples of code points. Nothing good can come of decomposing strings into Unicode code points. The code point abstraction is practically as low level as the internal byte encoding of the strings. Only lexing libraries should look at strings at that level, and you should use a well written and tested lexing library, not a hacky hand-coded lexer. Someone in this thread mentioned that they'd used ' '.join on a string in production code. Was the intent really to put a space between every pair of code points of an arbitrary string? Or did they know that only certain code points would appear in that string? A convenient way of splitting strings into more meaningful character units would make the actual intent clear in the code, and it would allow for runtime testing of the programmer's assumptions. Explicit access to code points should be ugly – s.__codepoints__, maybe. And that should be a sequence of integers, not strings like "́".
it’s probably worth at least considering making UTF-8 strings first-class objects. They can’t be randomly accessed,
They can be randomly accessed by abstract indices: objects that look similar to ints from C code, but that have no extractable integer value in Python code, so that they're independent of the underlying string representation. They can't be randomly accessed by code point index, but there's no reason you should ever want to randomly access a string by a code point index. It's a completely meaningless operation.

On Oct 25, 2019, at 20:44, Ben Rudiak-Gould <benrudiak@gmail.com> wrote:
Nothing good can come of decomposing strings into Unicode code points.
Which is why Unicode defines extended graphemes clusters, which are the intended approximation of human-language characters, and which can be made up of multiple code points. You don’t have to worry about combining and switch units and normalization, and so on. And it is definitely possible to design a language that deals with clusters efficiently. Swift does it, for example. And they treat code units as no more fundamental than the underlying code points. (And, as you’d expect, code points and UTF-8/16/32 code units are appropriately-sized integers, not chars.) In fact, Go treats the code units as _less_ fundamental: all strings are stored as UTF-8, so you can access the code units by just casting to a byte[], but if you want to iterate the code points (which are integers), you have to import functions for that, or encode to a UTF-32 not-a-string byte[], or use some special-case magic sugar. And Rust is essentially the same (but with more low-level stuff to write your own magic sugar, instead of hardcoded magic).
OK, but how do you write that lexer? Most people should just get it off PyPI, but someone has to write it an put it on PyPI, and it has to have access to either grapheme clusters, or code units, or code points, or there’s no way it can lex anything. Unless you’re suggesting that the lexing library needs to be written in C (and then you’ll need different ones for Jython, etc.)?
Sure, but as you argued, code points are almost never what you want. And clusters don’t have a fixed size, or integer indices; how would you represent them except as a char type?
You certainly can design a more complicated iteration/indexing protocol that handles this—C++, Swift, and Rust have exactly that, and even Python text files with seek and tell are roughly equivalent—but it’s explicitly not random access. You can only jump to a position you’ve already iterated to. For example, in Swift, you can’t write `s[..20]`, but if you write `s.first(of: ",")`, what you get back isn’t a number, it’s a String.Index, and you can use that in `s[..commaIndex]`. And a String.Index is not a random-access index, it’s only a bidirectional index—you can get the next or previous value, but you can’t get the Nth value (except in linear time, by calling next N times). Of course usually you don’t want to search for commas, you want to parse CSV or JSON or something so you don’t even care about this. But when you’re writing that CSV or JSON or whatever module, you do.
Yes, that’s my argument for why it’s likely acceptable that they can’t be randomly accessed, which I believe was the rest of the sentence that you cut off in the middle. However, I think that goes a _tiny_ bit too far. You _usually_ don’t want to randomly access a string, but not _never_. Let’s say you’re at the REPL, and you’re doing some exploratory hacking on a big hunk of text. Being able to use the result of that str.find from a few lines earlier in a slice is often handy. So it has to be something you can read and type easily. And it’s hard to get easier to type than an int. And notice that this is exactly how seek and tell work on text files. I don’t think the benefit of being able to avoid copy-pasting some ugly thing or repeating the find outweighs the benefit of not having to think in code units—but it is still a nonzero benefit. And the fact that Swift can’t do this in its quasi-REPL is sometimes a pain. Also, Rust lets you randomly access strings by UTF-8 byte position and then ask for the next character boundary from there, which is handy for hand-optimizing searches without having to cast things back and forth to [u8]. But again, I don’t think that benefit outweighs the benefit of not having to think in either code units or code points. Anyway, once you get rid of the ability to randomly access strings by code point, this means you don’t need to store strings as UTF-32 (or as Python’s clever UTF-8-or-16-or-32). When you read a UTF-8 text file (which is most text files you read nowadays). its buffer can already be the internal storage for a string. In fact, you can even mmap a UTF-8 text file and treat it as a Unicode string. (See RipGrep, which uses Rust’s ability to do this to make regex searching large files both faster and simpler at the same time, if it’s not obvious why this is nice.)

On Fri, Oct 25, 2019 at 08:44:17PM -0700, Ben Rudiak-Gould wrote:
Nothing good can come of decomposing strings into Unicode code points.
Sure there is. In Python, it's the fastest way to calculate the digit sum of an integer. It's also useful for implementing classical encryption algorithms, like Playfair. Introspection, e.g. if I want to know if a string contains any surrogates, I can do this: any('\uD800' <= c <= '\uDFFF' for c in s) Of perhaps I want to know if the string contains any "astral characters", in which case they aren't safe to pass to a Javascript or Tcl script which doesn't handle them correctly: any(c > '\uFFFF' for c in s) How about education? One of the things I can do with strings is: for c in string: print(unicodedata.name(c)) or possible even just # what is that weird symbol in position five? print(unicodedata.name(string[5])) to find out what that weird character is called, so I can look it up and find out what it means. Knowing stuff is good, right? Or do you think the world would be better off if it was really hard and "ugly" (your word) for people like me to find out what code points are called and what their meaning is? Rather than just telling us that we shouldn't be allowed to access code points in strings, would you please be explicit about *why* this access is a bad thing? And if code points are "bad", then what should we be allowed to do with strings? If code points is too low level, then what is an appropriate level? I guess you're probably going to mention grapheme clusters. (If you aren't, then I have no idea what your objection is based on.) Grapheme clusters are a hard problem to solve, since they are dependent on the language and the locale. There's a Unicode algorithm for splitting on graphemes, but it ignores the locale differences. Processing on graphemes is more expensive than on code points. There is, as far as I can tell, no O(1) access to graphemes in a string without pre-processing them and keeping a list of their indices. For many people, and for many purposes, paying that extra cost in either time or memory is just a total waste, since they're hardly ever going to come across a grapheme cluster. Few people have to process completely arbitrary strings: their data tends to come from a particular subset of natural language strings, and for some such languages, you might go a whole lifetime without coming across a grapheme cluster of more than one code point. (This may be slowly changing, even for American English, driven in part by the use of emoji and variation selectors.) If Python came with a grapheme processing API, I would probably use it. But in the meantime, the code point API is "good enough" for most things I do with strings. And for the rest, graphemes are too low-level: I need things like sentences; clauses, words, word stems, prefixes and suffixes, syllables etc. But even if Python had an excellent, fast grapheme API, I would still want a nice, clean, fast interface that operates on code-points. -- Steven

On Sun, Oct 13, 2019 at 12:41:55PM -0700, Andrew Barnert via Python-ideas wrote:
Indeed, and Guido did rule some time ago that 4.0 would be ordinary transition, like 3.7 to 3.8, not a big backwards breaking version change. I've taken up referring to some hypothetical future 3.0-like version as Python 5000 (not 4000) in analogy to Python 3000, but to emphasise just how far away it will be.
I don't see why you can't make arrays of UTF-8 indexable and provide random access to any code point. I understand that ``str`` in Micropython is implemented that way. The obvious implementation means that you lose O(1) indexing (to reach the N-th code point, you have to count from the beginning each time) but save memory over other encodings. (At worst, a code-point in UTF-8 takes three bytes, compared to four in UTF-16 or UTF-32.) There are ways to get back O(1) indexing, but they cost more memory. But why would you want an explicit UTF-8 string object? What benefit do you get from exposing the fact that the implementation happens to be UTF-8 rather than something else? (Not rhetorical questions.) If the UTF-8 object operates on the basis of Unicode code points, then its just a str, and the implementation is just an implementation detail. If the UTF-8 object operates on the basis of raw bytes, with no protection against malformed UTF-8 (e.g. allowing you to insert bytes 0x80-0xFF which are never valid in UTF-8, or by splitting apart a two- or three-byte UTF-8 sequence) then its just a bytes object (or bytearray) initialised with a UTF-8 sequence. That is, as I understand it, what languages like Go do. To paraphrase, they offer data types they *call* UTF-8 strings, except that they can contain arbitrary bytes and be invalid UTF-8. We can already do this, today, without the deeply misleading name: string.encode('utf-8') and then work with the bytes. I think this is even quite efficient in CPython's "Flexible string representation". For ASCII-only strings, the UTF-8 encoding uses the same storage as the original ASCII bytes. For others, the UTF-8 representation is cached for later use. So I don't see any advantage to this UTF-8 object. If the API works on code points, then it's just an implementation detail of str; if the API works on code units, that's just a fancy name for bytes. We already have both str and bytes so what is the purpose of this utf8 object? -- Steven

On Sat, Oct 26, 2019, 7:29 PM Steven D'Aprano
(At worst, a code-point in UTF-8 takes three bytes, compared to four in UTF-16 or UTF-32.)
http://www.fileformat.info/info/unicode/char/10000/index.htm

On Sat, Oct 26, 2019 at 07:38:19PM -0400, David Mertz wrote:
Oops, you're right, UTF-8 can use four code units (four bytes) too, I forgot about that. Thanks for the correction. So in the worst case, if your string consists of all (let's say) Linear-B syllables, UTF-8 will use four bytes per character, the same as UTF-32. But for strings consisting of a mix of (say) ASCII, Latin-1, etc with only a few Linear-B syllables, UTF-8 will use a lot less memory. -- Steven

Absolutely, utf-8 is a wonderful encoding. And indeed, worst case is the same storage requirement as utf-16 or utf-32. For O(1) random access into all strings, we have to eat 32-bits per character, one way or the other, but of course there are space/speed trade-offs one could make for intermediate behavior. On Sat, Oct 26, 2019, 7:58 PM Steven D'Aprano <steve@pearwood.info> wrote:

On Sat, Oct 26, 2019, at 20:26, David Mertz wrote:
A string representation considering of (say) a UTF-8 string, plus an auxiliary list of byte indices of, say, 256-codepoint-long chunks [along with perhaps a flag to say that the chunk is all-ASCII or not] would provide O(1) random access, though, of course, despite both being O(1), "single index access" vs "single index access then either another index access or up to 256 iterate-forward operations" aren't *really* the same speed.

Ok, true enough that dereferencing and limited linear search is still O(1). I could have phrased that slightly more precisely. But the trade-off part is true. Indexing into character 1 million of a utf-32 string is just one memory offset calculation, them following the reference. Indexing into the utf-8-with-offset-list is a couple dereferences, and on average 128 sequential scans. So it's not worse big-O, but it's around 100x slower... Still a lot faster than sequential scan of all 1 million though. What does actual CPython do currently to find that s[1_000_000], assuming utf-8 internal representation? On Sat, Oct 26, 2019, 11:02 PM Random832 <random832@fastmail.com> wrote:

PEP 393 The Unicode string type is changed to support multiple internal representations, depending on the character with the largest Unicode ordinal (1, 2, or 4 bytes) ... Ah, OK. I get it. One byte representation is only ASCII, which happens to match utf-8. Well, the latin-1 oddness. But the internal representation is utf-16 or utf-32 if the string contains code points requiring multi-byte representation. On Sun, Oct 27, 2019, 12:19 AM Chris Angelico <rosuav@gmail.com> wrote:

On Sat, Oct 26, 2019 at 11:34:34PM -0400, David Mertz wrote:
What does actual CPython do currently to find that s[1_000_000], assuming utf-8 internal representation?
CPython doesn't use a UTF-8 internal representation. MicroPython *may*, but I don't know if they do anything fancy to avoid O(N) indexing. IronPython and Jython use whatever .Net and Java use. CPython uses a custom implementation, the Flexible String Representation, which picks the smallest code unit size required to store all the characters in the string. # Pseudo-code c = max(string) # Highest code-point if c <= '\xFF': # effectively ASCII or Latin-1 use one byte per code point elif c <= '\uFFFF': # effectively UCS-2, or UTF-16 without the surregate pairs use two bytes per code point else: assert c <= '\U0001FFFF': # effectively UCS-4, or UTF-32 use four bytes per code point -- Steven

On Oct 26, 2019, at 21:33, Steven D'Aprano <steve@pearwood.info> wrote:
IronPython and Jython use whatever .Net and Java use.
Which makes them sequences of UTF-16 code units, not code points. Which is allowed for the Python 2.x unicode type, but would violate the rules for 3.x str, but neither one has a 3.x. If you want to deal with code points, you have to handle surrogates manually. (Actually, IIRC, one of the two has a str type that, despite being 2.x, is unicode rather than bytes, but with some extra undocumented functionality to smuggle bytes around in a str and have it sometimes work.)

On Sun, Oct 27, 2019, at 03:39, Andrew Barnert via Python-ideas wrote:
I do like the way GNU Emacs represents strings - abstractly, a string can contain any character, or any byte > 127 distinct from a character. Concretely, IIRC they are represented either as pure byte strings or as UTF-8 with "bytes > 127" represented as the extended UTF-8 representations of code points 0x3FFF80 through 0x3FFFFF [values between 0x110000 and 0x3FFF7F are used for other purposes].

On Oct 26, 2019, at 19:59, Random832 <random832@fastmail.com> wrote:
A string representation considering of (say) a UTF-8 string, plus an auxiliary list of byte indices of, say, 256-codepoint-long chunks [along with perhaps a flag to say that the chunk is all-ASCII or not] would provide O(1) random access, though, of course, despite both being O(1), "single index access" vs "single index access then either another index access or up to 256 iterate-forward operations" aren't *really* the same speed.
Yes, but that means constructing a string takes linear time, because you have to construct that index. You can’t just take a read/recv/mmap/result of a C library/whatever and use it as a string without doing linear work on it first. And you have to do that on _every_ string, even though you only need the index on a small percentage of them. (Unless you can statically look ahead at the code and prove that a string will never be indexed—which a Haskell compiler can do, but I don’t think it’s remotely feasible for a language like Python.) If you redesign your find, re.search, etc. APIs to not return character indexes, then I think you can get away with not having character-indexable strings. On the rare occasions where you need it, construct a tuple of chars. If that isn’t good enough, you can easily write a custom object that wraps a string and an index list together that acts like a string and a sequence of chars at the same time. There’s no need for the string type itself to do that.

On Sun, Oct 27, 2019 at 12:10:22AM -0700, Andrew Barnert via Python-ideas wrote:
If string.index(c) doesn't return the index of c in string, then what does it return? I think you are conflating the public API based on characters (to be precise: code points) for some underlying implementation based on bytes. Given zero-based indexing, and the string: "abÇÐεф" the index of "ф" better damn well be 5 rather than 8 (UTF-8), 10 (UTF-16) or 20 (UTF-32) or I'll be knocking on the API designer's door with a pitchfork and a flaming torch *wink* And returning <AbstractIndex object at 0xb7ce1bf0> is even worse. Strings might not be implemented as an array of characters. They could be a rope, a linked list, a piece table, a gap buffer, or something else. The public API which operates on code points should not depend on the implementation. Regardless of how your string is implemented, it is conceptually a sequential array of N code points indexed from 0 to N-1. -- Steven

On Sun, Oct 27, 2019 at 11:43 PM Steven D'Aprano <steve@pearwood.info> wrote:
And in response to the notion that you don't actually need the index, just a position marker... consider this: File "/home/rosuav/tmp/demo.py", line 1 print("Hello, world!') ^ SyntaxError: EOL while scanning string literal Well, either that, or we need to make it so that " "*<AbstractIndex object at 0xb7ce1bf0> results in the correct number of spaces to indent it to that position. That ought to bring in plenty of pitchforks... ChrisA

On Oct 27, 2019, at 05:49, Chris Angelico <rosuav@gmail.com> wrote:
So if those 12 glyphs take 14 code units because you’re using Stephen’s string and it’s in NFKD, getting 14 and then indenting two spaces too many (as Python does today) is not just a good-enough best effort, but something we actually want to ensure at all costs by making sure you always deal in code unit indexes?
Would you still bring pitchforks for " " * StrIndex(chars=12, points=14, bytes=22)? If so, then you require code to spell it as " " * index.chars instead of " " * index. It’s not like the namedtuple/structseq/dataclass/etc. repr is some innovative new idea nobody’s ever thought of to get a useful display, or like people can’t figure out how to get the index out of a regex match object. This is all simple stuff; I don’t get the incredulity that it could possibly be done. (Especially given that there are other languages that do exactly the same thing, like Swift, which ought to be proof that it’s not impossible.) (Could it be done without breaking a whole ton of existing code? I strongly doubt it. But my whole argument for why we shouldn’t be trying to “fix” strings in “Python 4000” in the first place is that the right fix probably cannot be done in a way that’s remotely feasible for backward compatibility. So I hope you wouldn’t expect that something additional that I suggested could be considered only if that unfeasible fix were implemented would itself be feasible.)

On Sun, Oct 27, 2019 at 10:07:41AM -0700, Andrew Barnert via Python-ideas wrote:
*scratches head* I'm not really sure how glyphs (the graphical representation of a character) comes into this discussion, but for what it's worth, I count 22, not 12 (excluding the leading spaces). I'm not really sure how you get "14 code units" either, since whatever internal representation you use (ASCII, Latin-1, UTF-8, UTF-16, UTF-32) the string will be one code unit per entity, whether we are counting code points, characters or glyphs, since it is all ASCII. I don't know of any encoding where ASCII characters require more than one code unit.
because you’re using Stephen’s string and it’s in NFKD, getting 14 and then indenting two spaces too many (as Python does today)
You mean something like this? py> value = äë +* 42 File "<stdin>", line 1 value = äë +* 42 ^ SyntaxError: invalid syntax (the identifier is 'a\N{COMBINING DIAERESIS}e\N{COMBINING DIAERESIS}') Yes, that looks like a bug to me, but a super low priority one to fix. (This is assuming that the Python interpreter promises to line the caret up with the offending symbol "always", rather than just making a best effort to do so.) And probably tough to fix too: I think you need to count in grapheme clusters, not code points, but even that won't fix the issue since it leaves you open to the *opposite* problem of undercounting if the terminal or IDE fails to display combining characters properly: value = a¨e¨ +* 42 ^ SyntaxError: invalid syntax I had to fake the above, because I couldn't find a terminal on my system which would misdisplay COMBINING DIAERESIS, but I've seen editors do it. It's not just a problem with combining characters. If I recall correctly the Unicode standard doesn't require variant selectors to be displayed as a single glyph. So you might not know how wide a grapheme cluster is unless you know the capabilities of the application displaying it. Handling text in its full generality, including combining characters, emojis, flags, East Asian wide character, etc, is really tough to do right. For the Python interpreter, it would require a huge amount of extra work for barely any payoff since 99.9% of Python syntax errors are not going to include any of the funny cases. As I think I said earlier, if Python had an API that understood grapheme clusters, I would probably use it in preference to the code point API for most text handling code. But let's not make the perfect the enemy of the good: if you have a line of source code which contains flags, Asian wide characters, combining accents, emoji selectors etc and the caret doesn't quite line up in the right place, oh well, que sera sera. [...]
Hell yes. If I need 12 spaces, why should I be forced to work out how many bytes the interpreter uses for that? Why should I care? I want 12 spaces, I don't care if that's 12 bytes or 24 or 48 or 7. I might not even know what Python's internal encoding is. Many people don't. To say nothing of the obnoxiousness of forcing me to write 39 characters "StrIndex(...)" when two would do. What if I get it wrong, and think that 12 characters is 6 points and 42 bytes when its actually 8 points and 46 bytes? Working in code points is not perfect, but "code point == character" is still an acceptable approximation for most uses of Python. And as Unicode continues to gather more momentum, eventually we'll need more powerful, more abstract but also more complicated APIs. But forcing the coder to go from things they work with ("I want 12 smiley faces") to trying to work with the internal implementation is a recipe for chaos: "Each smiley is two code points, a base and a variant selector, but they're astral characters so I have to double it, so that's 48 code points, and each code point is four bytes, so that's 188 bytes, wait, do I include the BOM or not?"
Can you link to an explanation of what Swift *actually* does, in detail?
(Could it be done without breaking a whole ton of existing code? I strongly doubt it.
Of course it can be: we leave the code-point string API alone, as it is now, and build a second API based on grapheme clusters, emoji variants etc to run in parallel. This allows people to transition from one to the other if and when they need to, rather than forcing them to pay the cost of working in graphemes whether they need it or not. -- Steven

On Oct 27, 2019, at 18:00, Steven D'Aprano <steve@pearwood.info> wrote:
Because, assuming your using a mono space font, the number of glyphs to the error is how many spaces you need to indent. This example happens to be pure ASCII, so the count of glyphs, extended grapheme clusters, code units, and code points happens to be the name. But just change that e to an è made of two combining code units—like the ç in your previous example might have been—and now there are still the same number of glyphs and clusters; but one fewer code point and one fewer code unit. Extended grapheme clusters are intended to be the best approximation of “characters” in the Unicode standard. Code units are not.
but for what it's worth, I count 22, not 12 (excluding the leading spaces).
Sorry; that was a typo. Plus, I miscounted on top of the typo; I meant to count the spaces.
Yes. It this is a general bug: everywhere that you count code units intending to use that as a count of glyphs or characters, both in Python itself and in third-party libraries and in applications. This is one of the most trivial examples, and you obviously wouldn’t break backward compatibility with everything solely to fix this example. And I don’t know why I have to keep repeating this, but one more time: I’m not proposing to change Python, I’m arguing to _not_ change Python, because it’s already good enough, and the suggested improvement wouldn’t make it right despite breaking lots of code, and making it right is a big thing that would break even more code. If I were designing a new language, I would do it right from the start, and it would not have this bug, or any of the other manifestations of the same issue, but Python 4000 (or even 5000) is not an opportunity to design a new language. (And to be clear: Python’s design made perfect sense when it was chosen; Unicode has just gotten more complicated since then. In fact, most other languages that adopted Unicode as early as Python got permanently stuck with the UCS-2 assumption, forcing all user code to deal with UTF-16 code units forever.)
Well, the reason I called it a good-enough best effort is because I assume that it’s only meant to be a best effort, and I think it’s good enough for that. I’m not the one who said people would be up in arms if that were broken, I’m the one arguing that people are fine with it being broken as long as it’s usually good enough.
And probably tough to fix too: I think you need to count in grapheme clusters, not code points,
Yes, that’s the whole point of the message you were responding to: extended grapheme clusters are the Unicode approximation of characters; code units are not. And a randomly-accessible sequence of grapheme clusters is impossible to do efficiently, but a sized iterable container, or a sequence-like thing that’s indexable by special indexes but not by integers, is. So tying the string type even more closely to code units would not fix it; changing the way it works as a Sequence would not fix it.
That’s a matter of working around broken editors, terminals; and IDEs—which do exist, but are uncommon, and getting less common. Not having a workaround for a broken editor that most people don’t use is not a bug in the same way as being broken in a properly-working environment is. (Not having a workaround for something broken that half the users in the world have to deal with, like Windows cmd.exe, would be a different story, of course. You can claim that it’s Windows’ bug, not yours, but that won’t make users happy. But I’m pretty sure that’s not an issue here.)
Obviously you wouldn’t redesign the whole text API just to make syntax error carets line up. You would do that to make thousands of different things easier to write correctly, and lining up those carets is just one of those things, and nowhere near the most important one.
If you know you need 12 spaces, you just multiply by 12; why do you think you need to work anything out? Adding str * StrIndex doesn’t require taking away str * int. Your example implied that you would be working out that count in some way—say, by calling str.find—and that you and many others would be horrified if that return value were not an integer, but you could multiply it by a string anyway. I don’t know why you see anything wrong with that, but I guessed that maybe it was because you couldn’t see, at the REPL, how many spaces you were multiplying. Having the thing that’s returned by str.find have the repr CharIndex(chars=12, points=14, bytes=22) instead of the generic repr would solve that. If that isn’t your problem with being able to multiply a str by a StrIndex, then I have no other guesses for what you think people would be raising pitchforks over.
The reference documentation for String starts at https://developer.apple.com/documentation/swift/string. (It should be the first thing that comes up in any search engine for Swift string.) You can follow the links from there to String.Index and String.Iterator, and from either of those to BidirectionalCollection, and from there to Collection, which explains how indexing works in general. There’s probably an easier to understand description at https://docs.swift.org/swift-book but it may not explain *exactly* what it does, because it’s meant as a user guide. Two things that may be confusing: Swift uses the exact same words as Python for its iteration/etc. protocols but all with different meanings (e.g., a Swift Sequence is a Python Iterable; a Python Sequence is a Swift IndexableCollection; etc.), and Swift makes heavy use of static typing (e.g., just as there are no separate display literals for Array, Set, etc., there are no separate display literals for Character and String; the literal "x" is a Character if you store it in a Character lvalue, a len-1 String if you store it in a String, and a TypeError if you sort it in a double).

I think that we're more or less in broad agreement, but I wanted to comment on this: On Sun, Oct 27, 2019 at 09:41:00PM -0700, Andrew Barnert wrote:
I don't think that's quite correct. See: http://www.unicode.org/glossary/#abstract_character http://www.unicode.org/glossary/#character http://www.unicode.org/glossary/#extended_grapheme_cluster http://www.unicode.org/glossary/#code_point From the glossay definition of code point: "A value, or position, for a character, in any coded character set." In other words, the code point is a numeric code such as U+041 that represents a character such as "A". (Except when it is a numeric code that represents a non-character.) And from definitions D60 and D61 here: http://www.unicode.org/versions/Unicode12.1.0/ch03.pdf "Grapheme clusters and extended grapheme clusters may not have any particular linguistic significance" "The grapheme cluster represents a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean SYLLABLE) together with any number of nonspacing marks applied to it." [Emphasis added.] "A grapheme cluster is similar, but not identical to a combining character sequence." So it is much more complicated than just "code point != character, extended grapheme cluster = character". Lots of code points are characters; lots of graphemes aren't characters but syllables or some other linguistic entity, or no linguistic entity at all; and lots of things that are characters aren't graphemes, such such combining character sequences. And none of this mentions what to do with variation selectors, flags etc. The whole thing is very complicated and I don't pretend to understand all the details. (Until now, I thought that combining character sequences were grapheme clusters. Apparently they aren't.) -- Steven

No, what I’m doing is avoiding conflating the public API based on characters with the underlying representation based on code points, treating them no more fundamental than the code units. You can still iterate the code points if you want to, because that’s occasionally useful. And you can also iterate the UTF-8 code units, because that’s also occasionally useful.
Really? Even if the string is in NFKD, as it would be if this were, say, the name of a file on a standard Mac file system? Then that Ç character is stored as the code unit U+0043 followed by the code unit U+0327, rather than the single unit U+00D0. So had it still better be 5, not 6? If so, Python 3 is broken, and always has been; where’s your pitchfork? And what were you going to do with that 5 anyway that it has to be an integer? Without a use case, you’re just demanding infinite flexibility regardless of what the cost might be. You _could_ make this work by building a grapheme cluster index at construction time for every string, or by storing strings as an array of grapheme clusters that are themselves arrays of code points rather than as a flat array, or by normalizing every string at construction time. But do you actually want to do any of those things; or is guaranteeing 5 rather than 6 there not worth the cost? Also, have you ever used seek and tell on a text file? What do you think tell gives you? According to the language spec; it could be anything and you have to treat it as an abstract index; I think in current CPython it’s a byte index. Where’s your pitchfork there?
And returning <AbstractIndex object at 0xb7ce1bf0> is even worse.
Why? That object can be used to index/slice/start finding at/etc. I suggested earlier that it could also have attributes that give you the integer character, code unit (byte), and, if you really want it, code point index. If you have a use for one of those, you use the one you have a use for. If not, why do you need it to be equal to any of those three integers, much less the least useful of them? If you’re just concerned about the REPL, then it can be <CharIndex(5) at 0xb7ce1bf0>, or even something eval-able like CharIndex(chars=5, units=6, bytes=10). Which isn’t as nice as a number I can just spot a few lines back and retype (as I mentioned before, this is occasionally an annoyance when dealing with Swift), but that’s a tradeoff that allows you to see the number 5 that you’re insisting you’d better be able to get even though you can’t actually use the number 5.
If you want a public API that’s independent of implementation, where a string could be a linked list, then you want a public API that doesn’t include indexing. If your language comes with fundamental builtin types where the [] operator takes linear time, then your language doesn’t have a [] operator like Python’s, or C++’s or most other languages with the same syntax; it has something that looks misleadingly like [] in other languages but has to be used differently.

On Sun, Oct 27, 2019, at 03:10, Andrew Barnert wrote:
constructing a string already takes linear time because you have to copy it into memory managed by the python garbage collector. And you can track whether you'll need the index in one pass while copying, rather than, as currently, having to do one pass to pick a representation and another to actually perform the copying and conversion, so my suggestion may have a cache locality advantage over the other way.

On Nov 2, 2019, at 20:33, Random832 <random832@fastmail.com> wrote:
Not necessarily. There are certainly _some_ strings that come into the interpreter (or extension module) as externally-allocated things that have to be copied. But not all, or even most, strings. Things like reading a file or recving from a socket, you allocate a buffer which is managed by your GC, and the string gets placed there, so there’s no need to copy it. When you mmap a file, you know the lifetime of the string is the lifetime of the mmap, so you don’t track it separately, much less copy it. And so on. Also, even when you do need to copy, a memcpy is a whole lot faster than a loop, even though they are both linear. Especially when that loop has additional operations (maybe even including a conditional that branches 80/20 or worse). But even without that, copying byte by byte, rather than by whatever chunks the CPU likes, can already be 16x as slow. Go often ends up copying strings unnecessarily, but the memcpy is still so much faster than the decode that Java/C#/Python/Ruby does that Go fanatics like to brag about their fast text handling (until you show them some Rust to Swift code that’s even faster as well as more readable…).
And you can track whether you'll need the index in one pass while copying, rather than, as currently, having to do one pass to pick a representation and another to actually perform the copying and conversion, so my suggestion may have a cache locality advantage over the other way.
Sure, the existing implementation of building strings is slow, and that’s what keeping strings in UTF-8 is intended to solve, and if your suggestion makes it take 1/4th as long (which seems possible, but obviously it’s just a number I pulled out of thin air), that’s nice—but nowhere near as nice as completely eliminating that cost. And most strings, you never need to randomly access (or only need to randomly access because other parts of the API, like str.find and re.search, make you), so why should you pay any cost, even if it’s only 1/4th the cost you pay in Python 3.8? (Also, for some random-access uses, it really is going to be faster to just decode to UTF-32 and subscript that; why build an index plus decoding when you can just decode?) If you’re already making a radical breaking change, why not get the radical benefits? Also, consider this: if str is unindexed and non-subscriptable, it’s trivial to build a class IndexedStr(str) whose __new__ builds the index (or copies it even passed an IndexedStr) and that adds __getitem__, while still acting as a str even at the C API level. Whenever you need random access, you construct an IndexedStr, the rest of the time you don’t bother. And you can even create special-purpose variants for special strings (I know this is always ASCII, or I know it’s always under 16 chars…) or specific use cases (I know I’m going to iterate backward, so repeatedly counting forward from index[idx%16] would be hugely wasteful). But if str builds an index, there’s no way to write a class FastStr(str) that skips that, or any of the variants that does it differently.

On Oct 26, 2019, at 16:28, Steven D'Aprano <steve@pearwood.info> wrote:
That _could_ change, especially if 3.9 is followed by 3.10 (or has that already been rejected?). But I think almost everyone agrees with Guido, and that’ll probably be true until the memory of 2.7 fades (a few years after Apple stops shipping it and the last Linux distros go out of LTS). I guess your 5000 implies about 16 years off, so… ok. But at that point, it makes as much sense to talk about a hypothetical new Python-like language.
Most of the time, you really don’t need random access to strings—except in the case where you got that integer index back from a the find method or a regex match object or something, in which case using Swift-style non-integer indexes, or Rust-style (and Python file object seek/tell) byte offsets, solves the problem just as well. But when you do want it, it’s very likely you don’t want it to take linear time. Providing indexing, but having it be unacceptably slow for anything but small strings, isn’t providing a useful feature, it’s providing a cruel tease. Logarithmic time is probably acceptable, but building that index takes linear time, so now constructing strings becomes slow, which is even worse (especially since it affects even strings you were never going to randomly access).
For novices who only deal with UTF-8, it might mean never having to call encode or decode again. But the real benefit is to enable low-level code (that in turn makes high-level code easier to write). Have you ever written code that mmaps a text file and processes it as text? You either have to treat it as bytes and not do proper Unicode (which works for some string operations—until the first time you get some data where it doesn’t), or implement all the Unicode algorithms yourself (especially fun if what you’re trying to do is, say, a regex search), or put a buffer in front of it and decode on the fly, defeating the whole point of mmap. Have you ever read an HTTP header as bytes to verify that it’s UTF-8 and then tried to switch to using the same socket connection as a text file object rather than binary? It’s doable, but it’s a pain. And the reason all of this is a pain is that when Python (and Java and Ruby and so on) added Unicode support, the idea of assuming most files and protocols and streams are UTF-8 was ridiculous. Making UTF-8 a little easier to deal with by making everything else either slower or harder to deal with was a terrible trade off then. But in 2019—much less in Python 5000-land—that’s no longer true.
If the UTF-8 object operates on the basis of Unicode code points, then its just a str, and the implementation is just an implementation detail.
Ideally, it can iterate any of code units (bytes), code points, or grapheme clusters, not just one. Because they’re all useful at different times. But most of the string methods would be in terms of grapheme clusters.
What’s this about inserting bytes? I’m not suggesting making strings mutable; that’s insane even for 5.0. :) Anyway, it’s just a bytes object with all of the string methods, and that duck types as a string for all third-party string functions and so on, which is a lot different than “just a bytes object”. But a much better way to see it is that it’s a str object that also offers direct access to its UTF-8 bytes. Which you don’t usually need, but it is sometimes useful. And it would be more useful if things like sockets and pipes and so on had UTF-8 modes where they could just send UTF-8 strings, without you having to manually wrap them in a TextIOWrapper with non-default args first. This would require lots of changes to the stdlib and to tons of existing third-party code, to the extent that I’m not sure even “Python 5000” makes it ok, but for a new Python-inspired language, that’s a different story…
We had to decode it from UTF-8 and encode it back. Sure, it gets cached so we don’t have to keep doing that over and over. But leaving it as UTF-8 in the first place means we don’t have to do it at all. Of course this is only true if the source literal or text file or API or network protocol or whatever was encoded in UTF-8. But most of them are. (For the rest, yes, we still have to decode from UTF-16-LE or Shift-JIS or cp1252 or whatever and re-encode as UTF-8—albeit with a minor shortcut for the first example. But that’s no worse than today, and it’s getting less common all the time anyway.)
Since we’re now talking 5000 rather than 4000, this could replace str rather than be in addition to it. And it would also replace many uses of bytes. People would still need bytes when they want a raw buffer of something that isn’t text, and when they want a buffer of something that’s not known to be UTF-8 (like the HTTP example–you start with bytes, then switch to utf8 once you know the encoding is utf8 or stick a stream decoder in front of it if it turns out not to be), but when you want a buffer of encoded text, the string is the buffer.
participants (12)
-
Anders Hovmöller
-
Andrew Barnert
-
Ben Rudiak-Gould
-
Chris Angelico
-
Christopher Barker
-
David Mertz
-
Greg Ewing
-
Paul Moore
-
Random832
-
Serhiy Storchaka
-
Steve Jorgensen
-
Steven D'Aprano