[Python-ideas] Support Unicode code point notation
Stephen J. Turnbull
stephen at xemacs.org
Sun Jul 28 14:00:39 CEST 2013
Nick Coghlan writes:
> It doesn't bother me that much personally, especially if it was a
> general comma delimited capability that also worked with other code
> point names,
I think it should bother you, though. It's not a problem for Python
core developers, it's true. Similarly, ISO 2022 was a great idea in
theory, and works fine for communication of text over streams. The
problem is when you want to embed that stream in some higher-level
protocol. So, for example, the original space-separated syntax breaks
one-argument split-string, while your comma-separated version breaks
CSV. You could fix both of those by using no separator and simply
finishing the current code point on encountering "U+" or "}", but
I doubt anybody would find that variant appealing.
Now, for program literals this isn't going to matter because a string
will be converted to internal representation by the compiler, and the
program never sees that syntax. But what about applications like web
frameworks which often eval client-supplied strings? I hope we are
not going to recommend they eval them before validating them!<wink/>
> but my inclination is to call YAGNI on the additional complexity.
"Using 'complexity' to refer to this syntax isn't really valid though
- what it is, is 'complicated'."<wink/>
> Using "modal encoding" to refer to that change isn't really valid
> though
No, it's quite correct, at least in ISO-land. There, a modal encoding
is one which must maintain state across *code points*. The single-
code-point "\N" syntax needs to maintain state across *code units*,
but when it's done with a code *point*, it's done - there's no state
to worry about before starting to parse the next one. By your
definition, UTF-8 is modal, but that doesn't seem a very useful
categorization to me.
More information about the Python-ideas
mailing list