[Python-3000] Displaying strings containing unicode escapes

Thu Apr 17 02:20:52 CEST 2008

I've reordered Guido's words.

Guido van Rossum writes:

 > For those of us with less capable IO devices, setting the error flag
 > for stdout and stderr to backslashreplace is probably the best
 > solution (and it solves more problems than just repr()).

True.  But it doesn't solve the ambiguity problem on capable displays.

 > And just like in Python 3000 we're using UTF-8 as the default
 > source encoding and allowing Unicode letters in identifiers, I
 > think we should bite the bullet and allow repr() of a string to
 > pass through all characters that the Unicode standard considers
 > printable.

The problem is that this doesn't display the representation of strings
and identifier names in an unambiguous way.  "AKMOT" could be
all-ASCII, it could be all-Cyrillic, or it could be a mixture of
ASCII, Cyrillic, and Greek.  Odds are quite good that there are other
scripts that could be mixed in, too.  This kind of mixing happens all
the time in Japanese, where people mix half-width and full-width ASCII
with abandon (especially when altering digits in dates).  I could
easily see a Russian using Cyrillic 'A' to uppercase an ASCII 'a' in
the same way.

How about choosing a standard Python repertoire (based on the Unicode
standard, of course) of which characters get a graphic repr and which
ones get \u-escaped, and have a post-hook for repr which gets passed
the string repr proposes to print out?  This hook would always be
identity in Python-distributed stuff, of course, but on the consenting
adults principle applications and modules outside of the stdlib could
use it.  Would that be acceptable?

The standard repertoire would grandfather ASCII, I suppose, because
for the foreseeable future most identifiers are going to be ASCII, and
all Python implementations will contain a lot of ASCII identifiers and
strings indefinitely.