[Python-3000] PEP 3138 - String representation in Python 3000

Tue May 6 08:09:17 CEST 2008

Hello! Well done! Thank you!

On Tue, May 06, 2008 at 09:56:24AM +0900, Atsuo Ishimoto wrote:
> I've written a PEP for new string representation in Python 3000.
> 
> Patch is updated at http://bugs.python.org/issue2630, and Guido
> updated a patch to Rietveld:
> http://codereview.appspot.com/767 .
> 
> I would appreciate your comments and help.
> 
> -----------------------------------------------
> PEP: 3138
> Title: String representation in Python 3000
> Version: $Revision$
> Last-Modified: $Date$
> Author: Atsuo Ishimoto <ishimoto--at--gembook.org>
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 05-May-2008
> Post-History:
> 
> Abstract
> ========
> 
> This PEP proposes new string representation form for Python 3000. In
> Python prior to Python 3000, the repr() built-in function converts
> arbitrary objects to printable ASCII strings for debugging and logging.
> For Python 3000, a wider range of characters, based on the Unicode
> standard, should be considered 'printable'.
> 
> 
> Motivation
> ==========
> 
> The current repr() converts 8-bit strings to ASCII using following
> algorithm.
> 
> - Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
> 
> - Convert other non-printable characters(0x00-0x1f, 0x7f) and non-ASCII
>   characters(>=0x80) to '\\xXX'.
> 
> - Backslash-escape quote characters(' or ")

   Currently Python doesn't escape double-quote ("), it only escapes
apostrophe (').

> and add quote character at

   "and add the quote character (apostrophe, ') at"

>   head and tail.

   I think they are "the beginning" and "the end" of the string.

> For Unicode strings, the following additional conversions are done.
> 
> - Convert leading surrogate pair characters without trailing character
>   (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
> 
> - Convert 16-bit characters(>=0x100) to '\\uXXXX'.
> 
> - Convert 21-bit characters(>=0x10000) and surrogate pair characters to
>   '\\U00xxxxxx'.
> 
> This algorithm converts any string to printable ASCII, and repr() is
> used as handy and safe way to print strings for debugging or for
> logging. Although all non-ASCII characters are escaped, this does not
> matter when most of the string's characters are ASCII. But for other
> languages, such as Japanese where most characters in a string are not
> ASCII, this is very inconvenient. Python 3000 has a lot of nice features
> for non-Latin users such as non-ASCII identifiers, so it would be
> helpful if Python could also progress in a similar way for printable
> output.
> 
> Some users might be concerned that such output will mess up their
> console if they print binary data like images. But this is unlikely to
> happen in practice because bytes and strings are different types in
> Python 3000, so printing an image to the console won't mess it up.
> 
> This issue was once discussed by Hye-Shik Chang [1]_ , but was rejected.
> 
> 
> Specification
> =============
> 
> - The algorithm to build repr() strings should be changed to:
> 
>   * Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
> 
>   * Convert other non-printable ASCII characters(0x00-0x1f, 0x7f) to
>     '\\xXX'.
> 
>   * Convert leading surrogate pair characters without trailing character
>     (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
> 
>   * Convert Unicode whitespace other than ASCII space('\\x20') and
>     control characters (categories Z* and C* in the Unicode database)
>     to 'xXX', '\\uXXXX' or '\\U00xxxxxx'.
> 
> - Set the Unicode error-handler for sys.stdout and sys.stderr to
>   'backslashreplace' by default.
> 
> 
> Rationale
> =========
> 
> The repr() in Python 3000 should be Unicode not ASCII based, just like
> Python 3000 strings. Also, conversion should not be affected by the
> locale setting, because the locale is not necessarily the same as the
> output device's locale. For example, it is common for a daemon process
> to be invoked in an ASCII setting, but writes UTF-8 to its log files.

   Not only to log files. HTTP daemons, e.g., run with one locale but
answer to all kinds of clients.

> Characters not supported by user's console are hex-escaped on printing,
> by the Unicode encoders' error-handler. If the error-handler of the
> output file is 'backslashreplace', such characters are hex-escaped
> without raising UnicodeEncodeError. For example, if your default
> encoding is ASCII, ``print('�')`` will prints '\\xa2'. If your encoding
> is ISO-8859-1, '' will be printed.
> 
> 
> Printable characters
> --------------------
> 
> The Unicode standard doesn't define Non-printable characters, so we must
> create our own definition. Here we propose to define Non-printable
> characters as follows.
> 
> - Non-printable ASCII characters as Python 2.
> 
> - Broken surrogate pair characters.
> 
> - Characters defined in the Unicode character database as
> 
>   * Cc (Other, Control)
>   * Cf (Other, Format)
>   * Cs (Other, Surrogate)
>   * Co (Other, Private Use)
>   * Cn (Other, Not Assigned)
>   * Zl Separator, Line ('\\u2028', LINE SEPARATOR)
>   * Zp Separator, Paragraph ('\\u2029', PARAGRAPH SEPARATOR)
>   * Zs (Separator, Space) other than ASCII space('\\x20'). Characters in
>     this category should be escaped to avoid ambiguity.
> 
> 
> Alternate Solutions
> -------------------
> 
> To help debugging in non-Latin languages without changing repr(), other
> suggestion were made.
> 
> - Supply a tool to print lists or dicts.
> 
>  Strings to be printed for debugging are not only contained by lists or
>  dicts, but also in many other types of object. File objects contain a
>  file name in Unicode, exception objects contain a message in Unicode,
>  etc. These strings should be printed in readable form when repr()ed.
>  It is unlikely to be possible to implement a tool to print all
>  possible object types.
> 
> - Use sys.displayhook and sys.excepthook.
> 
>  For interactive sessions, we can write hooks to restore hex escaped
>  characters to the original characters. But these hooks are called only
>  when the result of evaluating an expression entered in an interactive
>  Python session, and doesn't work for the print() function or for
>  non-interactive sessions.

   Or for logging.debug("%r", ...)

> - Subclass sys.stdout and sys.stderr.
> 
>  It is difficult to implement a subclass to restore hex-escaped
>  characters since there isn't enough information left by the time it's
>  a string to undo the escaping correctly in all cases. For example, ``
>  print("\\"+"u0041")`` should be printed as '\\u0041', not 'A'. But
>  there is no chance to tell file objects apart.
> 
> - Make the encoding used by unicode_repr() adjustable.
> 
>  There is no benefit preserving the current repr() behavior to make
>  application/library authors aware of non-ASCII repr(). And selecting
>  an encoding on printing is more flexible than having a global setting.
> 
> 
> Open Issues
> ===========
> 
> - A lot of people use UTF-8 for their encoding, for example, en_US.utf8
>   and de_DE.utf8. In such cases, the backslashescape trick doesn't work.

   Also there is a problem of similarly drawing characters in Western,
Greek and Cyrillic languages. These languages use similar (but different)
alphabets (descended from the common ancestor) and contain letters that
look similar but has different character codes. For example, it is hard to
distinguish Latin 'a', 'e' and 'o' from Cyrillic 'а', 'е' and 'о'. (The
visual representation, of course, very much depends on the fonts used but
usually these letters are almost indistinguishable.)

> Backwards Compatibility
> =======================
> 
> Changing repr() may break some existing codes, especially testing code.
> Five of Python's regression test fail with this modification. If you
> need repr() strings without non-ASCII character as Python 2, you can use
> following function.
> 
> ::
> 
>  def repr_ascii(obj):
>      return str(repr(obj).encode("ASCII", "backslashreplace"), "ASCII")
> 
> 
> Reference Implementation
> ========================
> 
> http://bugs.python.org/issue2630
> 
> 
> References
> ==========
> 
> .. [1] Multibyte string on string::string_print
>        (http://bugs.python.org/issue479898)
> 
> 
> Copyright
> =========
> 
> This document has been placed in the public domain.
> _______________________________________________
> Python-3000 mailing list
> Python-3000 at python.org
> http://mail.python.org/mailman/listinfo/python-3000
> Unsubscribe: http://mail.python.org/mailman/options/python-3000/phd%40phd.pp.ru

Oleg.
-- 
     Oleg Broytmann            http://phd.pp.ru/            phd at phd.pp.ru
           Programmers don't die, they just GOSUB without RETURN.