[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

28 Oct 2019

      On Sun, Oct 27, 2019 at 10:07:41AM -0700, Andrew Barnert via Python-ideas wrote:
...
...
File "/home/rosuav/tmp/demo.py", line 1
   print("Hello, world!')
                        ^
SyntaxError: EOL while scanning string literal
So if those 12 glyphs take 14 code units
*scratches head*

I'm not really sure how glyphs (the graphical representation of a 
character) comes into this discussion, but for what it's worth, I 
count 22, not 12 (excluding the leading spaces).

I'm not really sure how you get "14 code units" either, since whatever 
internal representation you use (ASCII, Latin-1, UTF-8, UTF-16, UTF-32) 
the string will be one code unit per entity, whether we are counting 
code points, characters or glyphs, since it is all ASCII. I don't know 
of any encoding where ASCII characters require more than one code unit.
...
because you’re using Stephen’s string and it’s in NFKD, getting 14 and 
then indenting two spaces too many (as Python does today)
You mean something like this?

    py> value = äë +* 42
      File "<stdin>", line 1
        value = äë +* 42
                      ^
    SyntaxError: invalid syntax

(the identifier is 'a\N{COMBINING DIAERESIS}e\N{COMBINING DIAERESIS}')

Yes, that looks like a bug to me, but a super low priority one to fix.

(This is assuming that the Python interpreter promises to line the caret 
up with the offending symbol "always", rather than just making a best 
effort to do so.)

And probably tough to fix too: I think you need to count in grapheme 
clusters, not code points, but even that won't fix the issue since it 
leaves you open to the *opposite* problem of undercounting if the 
terminal or IDE fails to display combining characters properly:

        value = a¨e¨ +* 42
                    ^
    SyntaxError: invalid syntax

I had to fake the above, because I couldn't find a terminal on my system 
which would misdisplay COMBINING DIAERESIS, but I've seen editors do it.

It's not just a problem with combining characters. If I recall correctly 
the Unicode standard doesn't require variant selectors to be displayed 
as a single glyph. So you might not know how wide a grapheme cluster is 
unless you know the capabilities of the application displaying it.

Handling text in its full generality, including combining characters, 
emojis, flags, East Asian wide character, etc, is really tough to do 
right. For the Python interpreter, it would require a huge amount of 
extra work for barely any payoff since 99.9% of Python syntax errors are 
not going to include any of the funny cases.

As I think I said earlier, if Python had an API that understood grapheme 
clusters, I would probably use it in preference to the code point API 
for most text handling code. But let's not make the perfect the enemy of 
the good: if you have a line of source code which contains flags, Asian 
wide characters, combining accents, emoji selectors etc and the caret 
doesn't quite line up in the right place, oh well, que sera sera.

[...]
...
...
Well, either that, or we need to make it so that " "* results in the correct number of spaces to
indent it to that position. That ought to bring in plenty of
pitchforks...
Would you still bring pitchforks for " " * StrIndex(chars=12, points=14, bytes=22)?
Hell yes. If I need 12 spaces, why should I be forced to work out how 
many bytes the interpreter uses for that? Why should I care? I want 
12 spaces, I don't care if that's 12 bytes or 24 or 48 or 7. I might 
not even know what Python's internal encoding is. Many people don't.

To say nothing of the obnoxiousness of forcing me to write 39 characters 
"StrIndex(...)" when two would do. What if I get it wrong, and think 
that 12 characters is 6 points and 42 bytes when its actually 8 points 
and 46 bytes?

Working in code points is not perfect, but "code point == character" is 
still an acceptable approximation for most uses of Python. And as 
Unicode continues to gather more momentum, eventually we'll need more 
powerful, more abstract but also more complicated APIs. But forcing the 
coder to go from things they work with ("I want 12 smiley faces") to 
trying to work with the internal implementation is a recipe for chaos:

"Each smiley is two code points, a base and a variant selector, but 
they're astral characters so I have to double it, so that's 48 code 
points, and each code point is four bytes, so that's 188 bytes, wait, do 
I include the BOM or not?"
...
This is all simple stuff; I don’t get the incredulity 
that it could possibly be done. (Especially given that there are other 
languages that do exactly the same thing, like Swift, which ought to 
be proof that it’s not impossible.)
Can you link to an explanation of what Swift *actually* does, in detail?
...
(Could it be done without breaking a whole ton of existing code? I 
strongly doubt it.
Of course it can be: we leave the code-point string API alone, as it is 
now, and build a second API based on grapheme clusters, emoji variants 
etc to run in parallel. This allows people to transition from one to the 
other if and when they need to, rather than forcing them to pay the cost 
of working in graphemes whether they need it or not.

-- 
Steven

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

Steven D'Aprano