On Sun, Oct 27, 2019 at 10:07:41AM -0700, Andrew Barnert via Python-ideas wrote:
File "/home/rosuav/tmp/demo.py", line 1 print("Hello, world!') ^ SyntaxError: EOL while scanning string literal
So if those 12 glyphs take 14 code units
*scratches head* I'm not really sure how glyphs (the graphical representation of a character) comes into this discussion, but for what it's worth, I count 22, not 12 (excluding the leading spaces). I'm not really sure how you get "14 code units" either, since whatever internal representation you use (ASCII, Latin-1, UTF-8, UTF-16, UTF-32) the string will be one code unit per entity, whether we are counting code points, characters or glyphs, since it is all ASCII. I don't know of any encoding where ASCII characters require more than one code unit.
because you’re using Stephen’s string and it’s in NFKD, getting 14 and then indenting two spaces too many (as Python does today)
You mean something like this? py> value = äë +* 42 File "<stdin>", line 1 value = äë +* 42 ^ SyntaxError: invalid syntax (the identifier is 'a\N{COMBINING DIAERESIS}e\N{COMBINING DIAERESIS}') Yes, that looks like a bug to me, but a super low priority one to fix. (This is assuming that the Python interpreter promises to line the caret up with the offending symbol "always", rather than just making a best effort to do so.) And probably tough to fix too: I think you need to count in grapheme clusters, not code points, but even that won't fix the issue since it leaves you open to the *opposite* problem of undercounting if the terminal or IDE fails to display combining characters properly: value = a¨e¨ +* 42 ^ SyntaxError: invalid syntax I had to fake the above, because I couldn't find a terminal on my system which would misdisplay COMBINING DIAERESIS, but I've seen editors do it. It's not just a problem with combining characters. If I recall correctly the Unicode standard doesn't require variant selectors to be displayed as a single glyph. So you might not know how wide a grapheme cluster is unless you know the capabilities of the application displaying it. Handling text in its full generality, including combining characters, emojis, flags, East Asian wide character, etc, is really tough to do right. For the Python interpreter, it would require a huge amount of extra work for barely any payoff since 99.9% of Python syntax errors are not going to include any of the funny cases. As I think I said earlier, if Python had an API that understood grapheme clusters, I would probably use it in preference to the code point API for most text handling code. But let's not make the perfect the enemy of the good: if you have a line of source code which contains flags, Asian wide characters, combining accents, emoji selectors etc and the caret doesn't quite line up in the right place, oh well, que sera sera. [...]
Well, either that, or we need to make it so that " "*
results in the correct number of spaces to indent it to that position. That ought to bring in plenty of pitchforks... Would you still bring pitchforks for " " * StrIndex(chars=12, points=14, bytes=22)?
Hell yes. If I need 12 spaces, why should I be forced to work out how many bytes the interpreter uses for that? Why should I care? I want 12 spaces, I don't care if that's 12 bytes or 24 or 48 or 7. I might not even know what Python's internal encoding is. Many people don't. To say nothing of the obnoxiousness of forcing me to write 39 characters "StrIndex(...)" when two would do. What if I get it wrong, and think that 12 characters is 6 points and 42 bytes when its actually 8 points and 46 bytes? Working in code points is not perfect, but "code point == character" is still an acceptable approximation for most uses of Python. And as Unicode continues to gather more momentum, eventually we'll need more powerful, more abstract but also more complicated APIs. But forcing the coder to go from things they work with ("I want 12 smiley faces") to trying to work with the internal implementation is a recipe for chaos: "Each smiley is two code points, a base and a variant selector, but they're astral characters so I have to double it, so that's 48 code points, and each code point is four bytes, so that's 188 bytes, wait, do I include the BOM or not?"
This is all simple stuff; I don’t get the incredulity that it could possibly be done. (Especially given that there are other languages that do exactly the same thing, like Swift, which ought to be proof that it’s not impossible.)
Can you link to an explanation of what Swift *actually* does, in detail?
(Could it be done without breaking a whole ton of existing code? I strongly doubt it.
Of course it can be: we leave the code-point string API alone, as it is now, and build a second API based on grapheme clusters, emoji variants etc to run in parallel. This allows people to transition from one to the other if and when they need to, rather than forcing them to pay the cost of working in graphemes whether they need it or not. -- Steven