[Python-ideas] Support Unicode code point notation

Stephen J. Turnbull stephen at xemacs.org
Sun Jul 28 14:00:39 CEST 2013


Nick Coghlan writes:

 > It doesn't bother me that much personally, especially if it was a
 > general comma delimited capability that also worked with other code
 > point names,

I think it should bother you, though.  It's not a problem for Python
core developers, it's true.  Similarly, ISO 2022 was a great idea in
theory, and works fine for communication of text over streams.  The
problem is when you want to embed that stream in some higher-level
protocol.  So, for example, the original space-separated syntax breaks
one-argument split-string, while your comma-separated version breaks
CSV.  You could fix both of those by using no separator and simply
finishing the current code point on encountering "U+" or "}", but
I doubt anybody would find that variant appealing.

Now, for program literals this isn't going to matter because a string
will be converted to internal representation by the compiler, and the
program never sees that syntax.  But what about applications like web
frameworks which often eval client-supplied strings?  I hope we are
not going to recommend they eval them before validating them!<wink/>

 > but my inclination is to call YAGNI on the additional complexity.

"Using 'complexity' to refer to this syntax isn't really valid though
- what it is, is 'complicated'."<wink/>

 > Using "modal encoding" to refer to that change isn't really valid
 > though

No, it's quite correct, at least in ISO-land.  There, a modal encoding
is one which must maintain state across *code points*.  The single-
code-point "\N" syntax needs to maintain state across *code units*,
but when it's done with a code *point*, it's done - there's no state
to worry about before starting to parse the next one.  By your
definition, UTF-8 is modal, but that doesn't seem a very useful
categorization to me.


More information about the Python-ideas mailing list