[Python-checkins] r63902 - peps/trunk/pep-3138.txt

Tue Jun 3 00:19:34 CEST 2008

Author: guido.van.rossum
Date: Tue Jun  3 00:19:25 2008
New Revision: 63902

Log:
New version from Atsuo.


Modified:
   peps/trunk/pep-3138.txt

Modified: peps/trunk/pep-3138.txt
==============================================================================

--- peps/trunk/pep-3138.txt	(original)
+++ peps/trunk/pep-3138.txt	Tue Jun  3 00:19:25 2008
@@ -9,11 +9,12 @@
 Created: 05-May-2008
 Post-History:
 
+
 Abstract
 ========
 
-This PEP proposes new string representation form for Python 3000. In
-Python prior to Python 3000, the repr() built-in function converts
+This PEP proposes a new string representation form for Python 3000. In
+Python prior to Python 3000, the repr() built-in function converted
 arbitrary objects to printable ASCII strings for debugging and logging.
 For Python 3000, a wider range of characters, based on the Unicode
 standard, should be considered 'printable'.
@@ -28,30 +29,39 @@
 - Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
 
 - Convert other non-printable characters(0x00-0x1f, 0x7f) and non-ASCII
-  characters(>=0x80) to '\\xXX'.
+ characters(>=0x80) to '\\xXX'.
 
-- Backslash-escape quote characters(' or ") and add quote character at
-  head and tail.
+- Backslash-escape quote characters (apostrophe, ') and add the quote
+ character at the beginning and the end.
 
 For Unicode strings, the following additional conversions are done.
 
 - Convert leading surrogate pair characters without trailing character
-  (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
+ (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
 
 - Convert 16-bit characters(>=0x100) to '\\uXXXX'.
 
 - Convert 21-bit characters(>=0x10000) and surrogate pair characters to
-  '\\U00xxxxxx'.
+ '\\U00xxxxxx'.
 
 This algorithm converts any string to printable ASCII, and repr() is
-used as handy and safe way to print strings for debugging or for
+used as a handy and safe way to print strings for debugging or for
 logging. Although all non-ASCII characters are escaped, this does not
 matter when most of the string's characters are ASCII. But for other
 languages, such as Japanese where most characters in a string are not
-ASCII, this is very inconvenient. Python 3000 has a lot of nice features
-for non-Latin users such as non-ASCII identifiers, so it would be
-helpful if Python could also progress in a similar way for printable
-output.
+ASCII, this is very inconvenient.
+
+We can use ``print(aJapaneseString)`` to get a readable string, but we
+don't have a similar workaround for printing strings from collections
+such as lists or tuples. ``print(listOfJapaneseStrings)`` uses repr() to
+build the string to be printed, so the resulting strings are always
+hex-escaped. Or when ``open(japaneseFilemame)`` raises an exception, the
+error message is something like ``IOError: [Errno 2] No such file or
+directory: '\u65e5\u672c\u8a9e'``, which isn't helpful.
+
+Python 3000 has a lot of nice features for non-Latin users such as
+non-ASCII identifiers, so it would be helpful if Python could also
+progress in a similar way for printable output.
 
 Some users might be concerned that such output will mess up their
 console if they print binary data like images. But this is unlikely to
@@ -64,22 +74,53 @@
 Specification
 =============
 
+- Add a new function to the Python C API ``int PY_UNICODE_ISPRINTABLE
+ (Py_UNICODE ch)``. This function returns 0 if repr() should escape the
+ Unicode character ``ch``; otherwise it returns 1. Characters that should
+ be escaped are defined in the Unicode character database as:
+
+ * Cc (Other, Control)
+ * Cf (Other, Format)
+ * Cs (Other, Surrogate)
+ * Co (Other, Private Use)
+ * Cn (Other, Not Assigned)
+ * Zl (Separator, Line), refers to LINE SEPARATOR ('\\u2028').
+ * Zp (Separator, Paragraph), refers to PARAGRAPH SEPARATOR ('\\u2029').
+ * Zs (Separator, Space) other than ASCII space('\\x20'). Characters in
+   this category should be escaped to avoid ambiguity.
+
 - The algorithm to build repr() strings should be changed to:
 
-  * Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
+ * Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
+
+ * Convert non-printable ASCII characters(0x00-0x1f, 0x7f) to '\\xXX'.
+
+ * Convert leading surrogate pair characters without trailing character
+   (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
+
+ * Convert non-printable characters(PY_UNICODE_ISPRINTABLE() returns 0)
+   to 'xXX', '\\uXXXX' or '\\U00xxxxxx'.
 
-  * Convert other non-printable ASCII characters(0x00-0x1f, 0x7f) to
-    '\\xXX'.
+ * Backslash-escape quote characters (apostrophe, 0x27) and add quote
+   character at the beginning and the end.
 
-  * Convert leading surrogate pair characters without trailing character
-    (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
+- Set the Unicode error-handler for sys.stderr to 'backslashreplace' by
+ default.
 
-  * Convert Unicode whitespace other than ASCII space('\\x20'), and
-    control characters (categories Z* and C* in the Unicode database),
-    to '\\xXX', '\\uXXXX' or '\\U00xxxxxx'.
+- Add ``'%a'`` string format operator. ``'%a'`` converts any python
+ object to a string using repr() and then hex-escapes all non-ASCII
+ characters. The ``'%a'`` format operator generates the same string as
+ ``'%r'`` in Python 2.
 
-- Set the Unicode error-handler for sys.stdout and sys.stderr to
-  'backslashreplace' by default.
+- Add a new built-in function, ``ascii()``. This function converts any
+ python object to a string using repr() and then hex-escapes all non-
+ ASCII characters. ``ascii()`` generates the same string as ``repr()``
+ in Python 2.
+
+- Add an ``isprintable()`` method to the string type. ``str.isprintable()``
+ returns False if repr() should escape any character in the string;
+ otherwise returns True. The ``isprintable()`` method calls the
+ `` PY_UNICODE_ISPRINTABLE()`` function internally.
 
 
 Rationale
@@ -90,44 +131,29 @@
 locale setting, because the locale is not necessarily the same as the
 output device's locale. For example, it is common for a daemon process
 to be invoked in an ASCII setting, but writes UTF-8 to its log files.
+Also, web applications might want to report the error information in
+more readable form based on the HTML page's encoding.
 
-Characters not supported by user's console are hex-escaped on printing,
-by the Unicode encoders' error-handler. If the error-handler of the
-output file is 'backslashreplace', such characters are hex-escaped
-without raising UnicodeEncodeError. For example, if your default
-encoding is ASCII, ``print('¢')`` will prints '\\xa2'. If your encoding
-is ISO-8859-1, '' will be printed.
-
-
-Printable characters
---------------------
-
-The Unicode standard doesn't define Non-printable characters, so we must
-create our own definition. Here we propose to define Non-printable
-characters as follows.
-
-- Non-printable ASCII characters as Python 2.
-
-- Broken surrogate pair characters.
-
-- Characters defined in the Unicode character database as
-
-  * Cc (Other, Control)
-  * Cf (Other, Format)
-  * Cs (Other, Surrogate)
-  * Co (Other, Private Use)
-  * Cn (Other, Not Assigned)
-  * Zl Separator, Line ('\\u2028', LINE SEPARATOR)
-  * Zp Separator, Paragraph ('\\u2029', PARAGRAPH SEPARATOR)
-  * Zs (Separator, Space) other than ASCII space('\\x20'). Characters in
-    this category should be escaped to avoid ambiguity.
-
+Characters not supported by the user's console could be hex-escaped on
+printing, by the Unicode encoder's error-handler. If the error-handler
+of the output file is 'backslashreplace', such characters are hex-
+escaped without raising UnicodeEncodeError. For example, if your default
+encoding is ASCII, ``print('Hello ¢')`` will prints 'Hello \\xa2'. If
+your encoding is ISO-8859-1, 'Hello ¢' will be printed.
+
+Default error-handler of sys.stdout is 'strict'. Other applications
+reading the output might not understand hex-escaped characters, so
+unsupported characters should be trapped when writing. If you need to
+escape unsupported characters, you should change error-handler
+explicitly. For sys.stderr, default error-handler is set to
+'backslashreplace' and printing exceptions or error messages won't
+be failed.
 
 Alternate Solutions
 -------------------
 
 To help debugging in non-Latin languages without changing repr(), other
-suggestion were made.
+suggestions were made.
 
 - Supply a tool to print lists or dicts.
 
@@ -142,9 +168,9 @@
 
  For interactive sessions, we can write hooks to restore hex escaped
  characters to the original characters. But these hooks are called only
- when the result of evaluating an expression entered in an interactive
- Python session, and doesn't work for the print() function or for
- non-interactive sessions.
+ when printing the result of evaluating an expression entered in an
+ interactive Python session, and doesn't work for the print() function,
+ for non-interactive sessions or for logging.debug("%r", ...), etc.
 
 - Subclass sys.stdout and sys.stderr.
 
@@ -154,34 +180,91 @@
  print("\\"+"u0041")`` should be printed as '\\u0041', not 'A'. But
  there is no chance to tell file objects apart.
 
-- Make the encoding used by unicode_repr() adjustable.
+- Make the encoding used by unicode_repr() adjustable, and make the
+ existing repr() the default.
+
+ With adjustable repr(), the result of using repr() is unpredictable
+ and would make it impossible to write correct code involving repr().
+ And if current repr() is the default, then the old convention remains
+ intact and users may expect ASCII strings as the result of repr().
+ Third party applications or libraries could be confused when a custom
+ repr() function is used.
 
- There is no benefit preserving the current repr() behavior to make
- application/library authors aware of non-ASCII repr(). And selecting
- an encoding on printing is more flexible than having a global setting.
+
+Backwards Compatibility
+=======================
+
+Changing repr() may break some existing code, especially testing code.
+Five of Python's regression tests fail with this modification. If you
+need repr() strings without non-ASCII character as Python 2, you can use
+the following function. ::
+
+   def repr_ascii(obj):
+       return str(repr(obj).encode("ASCII", "backslashreplace"), "ASCII")
+
+For logging or for debugging, the following code can raise
+UnicodeEncodeError. ::
+
+   log = open("logfile", "w")
+   log.write(repr(data))     # UnicodeEncodeError will be raised
+                             # if data contains unsupported characters.
+
+To avoid exceptions being raised, you can explicitly specify the error-
+handler. ::
+
+   log = open("logfile", "w", errors="backslashreplace")
+   log.write(repr(data))  # Unsupported characters will be escaped.
+
+
+For a console that uses a Unicode-based encoding, for example, en_US.
+utf8 or de_DE.utf8, the backslashescape trick doesn't work and all
+printable characters are not escaped. This will cause a problem of
+similarly drawing characters in Western, Greek and Cyrillic languages.
+These languages use similar (but different) alphabets (descended from
+the common ancestor) and contain letters that look similar but have
+different character codes. For example, it is hard to distinguish Latin
+'a', 'e' and 'o' from Cyrillic '\u0430', '\u0435' and '\u043e'. (The visual
+representation, of course, very much depends on the fonts used but
+usually these letters are almost indistinguishable.) To avoid the
+problem, the user can adjust the terminal encoding to get a result
+suitable for their environment.
 
 
 Open Issues
 ===========
 
-- A lot of people use UTF-8 for their encoding, for example, en_US.utf8
-  and de_DE.utf8. In such cases, the backslashescape trick doesn't work.
+- Is the ``ascii()`` function necessary, or is it sufficient to document
+ how to do it? If necessary, should ``ascii()`` belong to the builtin
+ namespace?
 
 
-Backwards Compatibility
-=======================
+Rejected Proposals
+==================
 
-Changing repr() may break some existing codes, especially testing code.
-Five of Python's regression test fail with this modification. If you
-need repr() strings without non-ASCII character as Python 2, you can use
-following function.
+- Add encoding and errors arguments to the builtin print() function,
+ with defaults of sys.getfilesystemencoding() and 'backslashreplace'.
+
+ Complicated to implement, and in general, this is not seen as a good
+ idea. [2]_
+
+- Use character names to escape characters, instead of hex character
+ codes. For example, ``repr('\u03b1')`` can be converted to
+ ``"\N{GREEK SMALL LETTER ALPHA}"``.
+
+ Using character names can be very verbose compared to hex-escape.
+ e.g., ``repr("\ufbf9")`` is converted to ``"\N{ARABIC LIGATURE UIGHUR
+ KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA ISOLATED FORM}"``.
 
-::
+- Default error-handler of sys.stdout should be 'backslashreplace'.
 
- def repr_ascii(obj):
-     return str(repr(obj).encode("ASCII", "backslashreplace"), "ASCII")
+ Stuff written to stdout might be consumed by another program that
+ might misinterpret the \ escapes. For interactive session, it is
+ possible to make 'backslashreplace' error-handler to default, but may
+ add confusion of the kind "it works in interactive mode but not when
+ redirecting to a file".
 
 
+- Hide quoted text -
 Reference Implementation
 ========================
 
@@ -194,6 +277,8 @@
 .. [1] Multibyte string on string::string_print
        (http://bugs.python.org/issue479898)
 
+.. [2] [Python-3000] Displaying strings containing unicode escapes
+       (http://mail.python.org/pipermail/python-3000/2008-April/013366.html)
 
 Copyright
 =========