[Python-ideas] Support Unicode code point notation
Stephen J. Turnbull
stephen at xemacs.org
Sun Jul 28 11:05:00 CEST 2013
Steven D'Aprano writes:
> On 28/07/13 17:41, Stephen J. Turnbull wrote:
> > > (Sorry, I have forgotten who made that suggestion originally.) That
> > > could be extended to allow multiple space-separated code points:
> > >
> > > \N{U+xxxx U+yyyy U+zzzzz}
> > >
> > > or
> > >
> > > \N{U+xxxx yyyy zzzzz}
> >
> > This is a modal encoding, which has proved to be a really bad idea in
> > its past incarnations. I hope that extension is never added to
> > Python.
>
> Could you elaborate please? What do you mean "modal encoding", and
> what past incarnations are you referring to?
A "modal encoding" is one in which the same combination of code units
(here, ASCII characters) is interpreted differently depending on
arbitrarily distant context. One only has to look at certain web
pages or mail messages to see similar encodings (SGML numeric
character entities, quoted-printable encoding of text using non-Latin
character sets) abused to represent many lines of text. In such
(ab)uses, it's very easy to corrupt the whole stream accidentally by
losing one of the braces or by interpolating text encoded differently.
Sure, it's easy for humans to recognize what's going on, and recover,
when they encounter corrupted text interactively, but this is
obviously not a convention that's intended for interactive human use!
The main past incarnation is the ISO 2022 family.
I see no advantage in "readability" of "\N{U+xxxx U+yyyy U+zzzzz}" or
"\N{U+xxxx yyyy zzzzz}" over "\N{U+xxxx}\N{U+yyyy}\N{U+zzzzz}", and
very little space savings. Worst, it violates the basic understanding
that "\N{...}" is the name of one character or code point.
More information about the Python-ideas
mailing list