[Python-3000] PEP 3138- String representation in Python 3000

Thu May 8 18:52:02 CEST 2008

On 2008-05-06 15:55, Atsuo Ishimoto wrote:
> (I changed subject)
> 
> Thank you for your comment.
> 
> On Tue, May 6, 2008 at 8:45 PM, M.-A. Lemburg <mal at egenix.com> wrote:
> 
>>  For sys.stdout this doesn't make sense at all, since it hides encoding
>>  errors for all applications using sys.stdout as piping mechanism.
>>  -1 on that.
> 
> You can raise UnicodeEncodigError for encoding errors if you want, by
> setting sys.stdout's error-handler to `strict`.

No, that's not a good idea. I don't want to change every single
affected application just to make sure that they don't write
corrupt data to stdout.

>>  Both are really way beyond the scope of the PEP and I don't
>>  really see the need for them.
> 
> Even though this PEP was rejected, 

You mean PEP 3138 was rejected ??

> I'll still propose to change
> default error-handler for sys.stdout and for sys.stderr to
> 'backslashreplace'. For Python 2, 'strict' error-handler is acceptable
> because most of text data are 8-bit string, but for Py3K, raising
> exceptions when the printed text contains a character not supported by
> console is annoying.

Well, "annoying" is not good enough for such a big change :-)

Please also consider the different situations you are addressing:

  * console output (ie. printing)
  * stdout file output (ie. piping)
  * interactive session use (ie. running print at the Python prompt)

The backslashreplace idea may have some merrits in interactive
Python sessions or IDLE, but it hides encoding errors in all
other situations.

>>  They also don't cover the cases
>>  where you write the repr() to a log file, some stream or syslog.
> 
> Sure. I missed some cases, such as cgitb module or logging module.
> I'll investigate them later. If you have another candidate, please let
> me know.

You have to address the general use cases, not just specific
implementations in the Python stdlib - those can easily be changed,
but doing the same in all the existing code out there that wants
to get ported to Py3k is a different issue.

I'm not against changing the repr() of Unicode objects, but
please make sure that this change does not break debugging
Python applications. Whether you're debugging an app using
'print' statements, piping repr() through a socket to a remote
debugger or writing information to a log file. The important
factor to take into account is the other end that will receive
the data.

BTW: One problem that your PEP doesn't address, which I mentioned
on the ticket:

By putting all printable chars into the repr() you lose the
ability to actually see the number of code points you have
in a Unicode string.

A Unicode-aware editor, shell or pager
will display the data as glyphs and not as code points, ie.
glyphs expressed using combining code points will appear
as one "character" to the user - even though the Unicode object
contains multiple code points. As a result, the length and
any indexes you might use in the debugging session will not
match what the user sees in his shell window.

>>> - Characters defined in the Unicode character database as
> [snip]
>>  This is all very nice, but if that means that the whole Unicode
>>  database has to be loaded every time the interpreter starts up
>>  as you indicated on the ticket, them I'm firmly -1 against that.
> 
> I changed a patch to add a flag to the _PyUnicode_TypeRecords table,
> so the Unicode database is not loaded at stat up.

Thanks.

Please name the property Py_UNICODE_ISPRINTABLE. Py_UNICODE_ISHEXESCAPED
isn't all that intuitive.

And also add your definition from the PEP to unicodectype.c - since
this is not a Unicode standard.

I'd also appreciate if you could make that property available
as Unicode method, e.g. .isprintable().

This addition is good on its own.

>>  I proposed to make the Unicode repr() output a regular encoding
>>  that's being implemented by a codec. You could then easily
>>  change the encoding to whatever you need for your application
>>  or console.
> 
> I think global setting is not flexible enough. And I see no benefit to
> customizable repr() except to keep compatible with Python 2, but I
> think it is easy to migrate the existing code to the Py3k.

That's what I don't see in your PEP.

How can things easily be changed so that it's possible to get the
Py2.x style hex escaping back into Py3k without having to change
all repr() calls and %r format markers for Unicode objects ?

I can see your point with it being easier to read e.g. German,
Japanese or Korean data, but it still has to be possible to
use repr() for proper debugging which allows the user to
actually see what is stored in a Unicode object in terms of
code points.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 08 2008)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
            Registered at Amtsgericht Duesseldorf: HRB 46611