Re: [Python-ideas] Support Unicode code point notation

28 Jul 2013


      Nick Coghlan writes:
...
It doesn't bother me that much personally, especially if it was a
general comma delimited capability that also worked with other code
point names,
I think it should bother you, though.  It's not a problem for Python
core developers, it's true.  Similarly, ISO 2022 was a great idea in
theory, and works fine for communication of text over streams.  The
problem is when you want to embed that stream in some higher-level
protocol.  So, for example, the original space-separated syntax breaks
one-argument split-string, while your comma-separated version breaks
CSV.  You could fix both of those by using no separator and simply
finishing the current code point on encountering "U+" or "}", but
I doubt anybody would find that variant appealing.

Now, for program literals this isn't going to matter because a string
will be converted to internal representation by the compiler, and the
program never sees that syntax.  But what about applications like web
frameworks which often eval client-supplied strings?  I hope we are
not going to recommend they eval them before validating them!<wink/>
...
but my inclination is to call YAGNI on the additional complexity.
"Using 'complexity' to refer to this syntax isn't really valid though
- what it is, is 'complicated'."<wink/>
...
Using "modal encoding" to refer to that change isn't really valid
though
No, it's quite correct, at least in ISO-land.  There, a modal encoding
is one which must maintain state across *code points*.  The single-
code-point "\N" syntax needs to maintain state across *code units*,
but when it's done with a code *point*, it's done - there's no state
to worry about before starting to parse the next one.  By your
definition, UTF-8 is modal, but that doesn't seem a very useful
categorization to me.

Re: [Python-ideas] Support Unicode code point notation

Stephen J. Turnbull