[Python-3000] PEP 3108 - String representation in Python 3000

Tue May 6 13:45:53 CEST 2008

On 2008-05-06 02:56, Atsuo Ishimoto wrote:
> I've written a PEP for new string representation in Python 3000.
> 
> Patch is updated at http://bugs.python.org/issue2630, and Guido
> updated a patch to Rietveld:
> http://codereview.appspot.com/767 .
> 
> I would appreciate your comments and help.
 >...
> Specification
> =============
> 
> - The algorithm to build repr() strings should be changed to:
> 
>   * Convert CR, LF, TAB and '\\' to '\\r', '\\n', '\\t', '\\\\'.
> 
>   * Convert other non-printable ASCII characters(0x00-0x1f, 0x7f) to
>     '\\xXX'.
> 
>   * Convert leading surrogate pair characters without trailing character
>     (0xd800-0xdbff, but not followed by 0xdc00-0xdfff) to '\\uXXXX'.
> 
>   * Convert Unicode whitespace other than ASCII space('\\x20') and
>     control characters (categories Z* and C* in the Unicode database)
>     to 'xXX', '\\uXXXX' or '\\U00xxxxxx'.
> 
> - Set the Unicode error-handler for sys.stdout and sys.stderr to
>   'backslashreplace' by default.

For sys.stderr it may make sense to override any error reporting
because of encoding problems. -0 on that.

For sys.stdout this doesn't make sense at all, since it hides encoding
errors for all applications using sys.stdout as piping mechanism.
-1 on that.

Both are really way beyond the scope of the PEP and I don't
really see the need for them. They also don't cover the cases
where you write the repr() to a log file, some stream or syslog.

I'd be +1 on making the error handling of sys.stdout and sys.stderr
user adjustable.

> Printable characters
> --------------------
> 
> The Unicode standard doesn't define Non-printable characters, so we must
> create our own definition. Here we propose to define Non-printable
> characters as follows.
> 
> - Non-printable ASCII characters as Python 2.
> 
> - Broken surrogate pair characters.
> 
> - Characters defined in the Unicode character database as
> 
>   * Cc (Other, Control)
>   * Cf (Other, Format)
>   * Cs (Other, Surrogate)
>   * Co (Other, Private Use)
>   * Cn (Other, Not Assigned)
>   * Zl Separator, Line ('\\u2028', LINE SEPARATOR)
>   * Zp Separator, Paragraph ('\\u2029', PARAGRAPH SEPARATOR)
>   * Zs (Separator, Space) other than ASCII space('\\x20'). Characters in
>     this category should be escaped to avoid ambiguity.

This is all very nice, but if that means that the whole Unicode
database has to be loaded every time the interpreter starts up
as you indicated on the ticket, them I'm firmly -1 against that.

We've taken great care *not* to do this in Py2.x by moving
the database to a module that's imported only when needed.
It would be really silly to do this now, just to get some
Unicode repr() processed.

BTW, I'm sure it's possible to break down the above into a set of
ranges and switch cases that are easy to test without having to
lookup code points in the database. Even if you do end up using
the database, it should only be imported if the repr() really
does not need to lookup code points outside the Latin-1 range.

> Alternate Solutions
> -------------------
> 
> To help debugging in non-Latin languages without changing repr(), other
> suggestion were made.
> ...
> - Make the encoding used by unicode_repr() adjustable.
> 
>  There is no benefit preserving the current repr() behavior to make
>  application/library authors aware of non-ASCII repr(). And selecting
>  an encoding on printing is more flexible than having a global setting.

I'm not sure what you are saying here.

I proposed to make the Unicode repr() output a regular encoding
that's being implemented by a codec. You could then easily
change the encoding to whatever you need for your application
or console.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 06 2008)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
            Registered at Amtsgericht Duesseldorf: HRB 46611