[Python-3000] Displaying strings containing unicode escapes

Wed Apr 30 07:39:34 CEST 2008

Jim Jewett writes:
 > I think "standard repertoire based on Unicode" may be confusing the issue.

By "standard repertoire" I mean that all Pythons will show the same
characters the same way, while "based on Unicode" is intended to mean
looking at TR#36 and TR#39 in picking the repertoires.

 > As I understand it, you're saying something like
 > 
 >     For strings, repr will delegate to display_string.

Er, I'm not familiar with such a function....  What I have in mind is
that for string display, repr will have a large, standard set of
characters that it sends directly to output, and a set that it
\u-escapes for the purpose of avoiding ambiguity.  These sets are
always defined the same way for any Python.

For people for whom the standard display would be painful (eg,
Cyrillic users and Greek users), there would be an optional
post-processor (basically a codec) which would translate some
\u-escapes to characters, and should also translate the conflicting
characters (ie, ASCII in the case of Cyrillic and Greek) to
\u-escapes.

 >     Users can (and should) supply a display_string function
 > appropriate to their own system.

"Can", yes, but only on a "consenting adults" basis.  They should not
do so in most cases.

 >     The default display_string will display ASCII, and unicode-escape
 > everything else.

Definitely not.  The default should try to display anything that can
be displayed unambiguously.  If we don't do that, *nobody* will use
the default except us semi-lingual Americans, and there would be no
point in having a standard repertoire.

For practical purposes, the only scripts I know of where there will be
real problems are Cyrillic and Greek, because they share glyphs with
the Latin alphabet, and by default many of their characters would be
escaped.  I'm sure there are other such scripts, of course, I don't
mean to minimize the problem.  (Some Japanese will undoubtedly
complain about their full-width "ASCII", but I have no sympathy for
that particular self-inflicted injury: they are already deprecated in
Unicode as compatibility characters.)

On the other hand, Unicode was careful to assemble a unified set of
Latin characters.  Although some like the Angstrom symbol do have
compatibility encodings, I don't think that's a major worry.  The vast
majority of Asian characters (loosely defined, including not only the
Han ideographs but the radicals, Korean Hangul, Japanese and Chinese
syllabaries, etc) are going to be readable, too (for those with
appropriate fonts).