[Python-3000] UPDATED: PEP 3138- String representation in Python 3000

Sat May 24 12:49:34 CEST 2008

I updated a PEP 3138 - String representation in Python 3000.
Python wiki is also updated. (http://wiki.python.org/moin/Python3kStringRepr)

I would appreciate your comments and help.

-----------------------------------------------

PEP: 3138

Title: String representation in Python 3000
Version: $Revision$
Last-Modified: $Date$
Author: Atsuo Ishimoto <ishimoto--at--gembook.org>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created:  05-May-2008
Post-History:

Abstract
========

This PEP proposes new string representation form for Python 3000. In
Python prior to Python 3000, the ``repr()`` built-in function converts
arbitrary objects to printable ASCII strings for debugging and logging.
For Python 3000, a wider range of characters, based on the Unicode
standard, should be considered 'printable'.

Motivation
==========

The current ``repr()`` converts 8-bit strings to ASCII using following
algorithm.

- Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.

- Convert other non-printable characters(0x00-0x1f, 0x7f) and non-ASCII
  characters(>=0x80) to '\\xXX'.

- Backslash-escape quote characters(apostrophe, ') and add the quote
  character at the beginning and the end.

For Unicode strings, the following additional conversions are done.

- Convert leading surrogate pair characters without trailing character
  (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.

- Convert 16-bit characters(>=0x100) to '\\uXXXX'.

- Convert 21-bit characters(>=0x10000) and surrogate pair characters to
  '\\U00xxxxxx'.

This algorithm converts any string to printable ASCII, and ``repr()`` is
used as handy and safe way to print strings for debugging or for
logging. Although all non-ASCII characters are escaped, this does not
matter when most of the string's characters are ASCII. But for other
languages, such as Japanese where most characters in a string are not
ASCII, this is very inconvenient. Python 3000 has a lot of nice features
for non-Latin users such as non-ASCII identifiers, so it would be
helpful if Python could also progress in a similar way for printable
output.

Some users might be concerned that such output will mess up their
console if they print binary data like images. But this is unlikely to
happen in practice because bytes and strings are different types in
Python 3000, so printing an image to the console won't mess it up.

This issue was once discussed by Hye-Shik Chang [1]_ , but was rejected.

Specification
=============

- Add Python API ``int PY_UNICODE_ISPRINTABLE(Py_UNICODE ch)``. ``
  PY_UNICODE_ISPRINTABLE()`` return 0 if ``repr()`` should escape the
  Unicode character ``ch``, 1 otherwise. Characters should be escaped are

  * Characters defined in the Unicode character database as "Other"(Cc,
    Cf, Cs, Co, Cn).

  * Characters defined in the Unicode character database as "Separator"
    (Zl, Zp, Zs) other than ASCII space(0x20).

- The algorithm to build ``repr()`` strings should be changed to:

  * Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.

  * Convert non-printable ASCII characters(0x00-0x1f, 0x7f) to '\\xXX'.

  * Convert leading surrogate pair characters without trailing character
    (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.

  * Convert non-printable characters(PY_UNICODE_ISPRINTABLE() returns 0)
    to 'xXX', '\\uXXXX' or '\\U00xxxxxx'.

  * Backslash-escape quote characters(apostrophe, ') and add quote
    character at the beginning and the end.

- Set the Unicode error-handler for sys.stderr to 'backslashreplace' by
  default.

- Set the Unicode error-handler for sys.stdout in the Python interactive
  session to 'backslashreplace' by default.

- Add ``'%a'`` string format operator. ``'%a'`` converts any python
  object to string using ``repr()`` and then hex-escape all non-ASCII
  characters. ``'%a'`` operator generates same string as ``'%r'`` in
  Python 2.

- Add ``ascii()`` builtin function. ``ascii()`` converts any python
  object to string using ``repr()`` and then hex-escape all non-ASCII
  characters. ``ascii()`` generates same string as ``repr()`` in Python 2.

- Add ``isprintable()`` method to the string type. ``str.isprintable()``
  return True if ``repr()`` should escape the characters in the string,
  False otherwise. ``isprintable()`` method calls
  ``PY_UNICODE_ISPRINTABLE()`` internally.

Rationale
=========

The ``repr()`` in Python 3000 should be Unicode not ASCII based, just
like Python 3000 strings. Also, conversion should not be affected by the
locale setting, because the locale is not necessarily the same as the
output device's locale. For example, it is common for a daemon process
to be invoked in an ASCII setting, but writes UTF-8 to its log files.
Also, web applications might want to report the error information in
more readable form based on the HTML page's encoding.

Characters not supported by user's console are hex-escaped on printing,
by the Unicode encoder's error-handler. If the error-handler of the
output file is 'backslashreplace', such characters are hex-escaped
without raising UnicodeEncodeError. For example, if your default
encoding is ASCII, ``print('Hello ¢')`` will prints 'Hello \\xa2'.
If your encoding is ISO-8859-1, 'Hello ¢' will be printed.

For non-interactive session, default error-handler of sys.stdout should
be default to 'strict'. Other applications reading the output might not
understand hex-escaped characters, so un-supported characters should be
trapped when writing.

Printable characters
--------------------

The Unicode standard doesn't define Non-printable characters, so we must
create our own definition. Here we propose to define Non-printable
characters as follows.

- Non-printable ASCII characters as Python 2.

- Broken surrogate pair characters.

- Characters defined in the Unicode character database as

  * Cc (Other, Control)
  * Cf (Other, Format)
  * Cs (Other, Surrogate)
  * Co (Other, Private Use)
  * Cn (Other, Not Assigned)
  * Zl Separator, Line ('\\u2028', LINE SEPARATOR)
  * Zp Separator, Paragraph ('\\u2029', PARAGRAPH SEPARATOR)
  * Zs (Separator, Space) other than ASCII space('\\x20'). Characters in
    this category should be escaped to avoid ambiguity.

Alternate Solutions
-------------------

To help debugging in non-Latin languages without changing ``repr()``,
other suggestion were made.

- Supply a tool to print lists or dicts.

  Strings to be printed for debugging are not only contained by lists or
  dicts, but also in many other types of object. File objects contain a
  file name in Unicode, exception objects contain a message in Unicode,
  etc. These strings should be printed in readable form when repr()ed.
  It is unlikely to be possible to implement a tool to print all
  possible object types.

- Use sys.displayhook and sys.excepthook.

  For interactive sessions, we can write hooks to restore hex escaped
  characters to the original characters. But these hooks are called only
  when the result of evaluating an expression entered in an interactive
  Python session, and doesn't work for the print() function, for non-
  interactive sessions or for logging.debug("%r", ...), etc.

- Subclass sys.stdout and sys.stderr.

  It is difficult to implement a subclass to restore hex-escaped
  characters since there isn't enough information left by the time it's
  a string to undo the escaping correctly in all cases. For example, ``
  print("\\"+"u0041")`` should be printed as '\\u0041', not 'A'. But
  there is no chance to tell file objects apart.

- Make the encoding used by ``unicode_repr()`` adjustable, and make
  current ``repr()`` as default.

  With adjustable ``repr()``, result of ``repr()`` is unpredictable and
  would make impossible to write correct code involving ``repr()``. And
  if current ``repr()`` is default, then old convention remains intact
  and user may expect ASCII strings as the result of ``repr()``. Third
  party applications or libraries could be choked when custom ``repr()``
  function is used.

Backwards Compatibility
=======================

Changing ``repr()`` may break some existing codes, especially testing
code. Five of Python's regression test fail with this modification. If
you need ``repr()`` strings without non-ASCII character as Python 2, you
can use following function. ::

    def repr_ascii(obj):
        return str(repr(obj).encode("ASCII", "backslashreplace"), "ASCII")

For logging or for debugging, following code can raise UnicodeEncodeError. ::

    log = open("logfile", "w")
    log.write(repr(data))     # UnicodeEncodeError will be raised
                              # if data contains unsupported characters.

To avoid exceptions raised, you can specify error-handler explicitly. ::

    log = open("logfile", "w", errors="backslashreplace")
    log.write(repr(data))  # Unsupported characters will be escaped.

For the console with Unicode-based encoding, for example, en_US.utf8 and
de_DE.utf8, the backslashescape trick doesn't work and all printable
characters are not escaped. This will cause a problem of similarly
drawing characters in Western,Greek and Cyrillic languages. These
languages use similar (but different) alphabets (descended from the
common ancestor) and contain letters that look similar but has different
character codes. For example, it is hard to distinguish Latin 'a', 'e'
and 'o' from Cyrillic 'а', 'е' and 'о'. (The visual representation, of
course, very much depends on the fonts used but usually these letters
are almost indistinguishable.) To avoid the problem, user can adjust
terminal encoding to get desired result suitable for their environment
or use ``repr_ascii()`` described above.

Open Issues
===========

- Is ``ascii()`` function necessary, or documentation is just fine? If
  necessary, should ``ascii()`` belong to builtin namespace?

Rejected Proposals
==================

- Add encoding and errors arguments to the builtin print() function,
  with defaults of sys.getfilesystemencoding() and 'backslashreplace'.

  Complicated to implement, and in general, this is not seem to good
  idea. [2]_

- Use character names to escape characters, instead of hex character
  codes. For example, ``repr('\u03b1')`` can be converted to
  ``"\N{GREEK SMALL LETTER ALPHA}"``.

  Using character names get verbose compared to hex-escape. e.g., ``repr
  ("\ufbf9")`` is converted to ``"\N{ARABIC LIGATURE UIGHUR KIRGHIZ YEH
  WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM}"``.

Reference Implementation
========================

http://bugs.python.org/issue2630

References
==========

.. [1] Multibyte string on string\::string_print
        (http://bugs.python.org/issue479898)

.. [2] [Python-3000] Displaying strings containing unicode escapes
        (http://mail.python.org/pipermail/python-3000/2008-April/013366.html)

Copyright
=========

This document has been placed in the public domain.