[Python-3000] PEP 3138- String representation in Python 3000

Wed May 14 18:18:39 CEST 2008

Atuso

you are not really addressing my arguments in your reply.

My main concern is that repr(unicode) as well as '%r' is used
a lot in logging and debugging of applications.

In the 2.x series of Python, the output of repr() has traditionally
always been plain ASCII and does not require any special encoding
and also doesn't run into problems when mixing the output with
other encodings used in the log file, on the console or whereever
the output of repr() is sent.

You are now suggesting to break this convention by allowing
all printable code points to be used in the repr() output.
Depending on where you send the repr() output and the contents
of the PyUnicode object, this will likely result in exceptions
in the .write() method of the stream object.

Just adjusting sys.stdout and sys.stderr to prevent them from
falling over is not enough (and is indeed not within the scope
of the PEP, since those changes are *major* and not warranted
for just getting your Unicode repr() to work). repr() is very
often written to log files and those would all have to be
changed as well.

Now, as I've said before, I can see your point about wanting
to be able to read the Unicode code points, even if you use
repr() - instead of the more straight-forward .encode()
approach. However, when suggesting such changes, you always
have to see the other side as well:

  - Are there alternative ways to get the "problem" fixed ?
  - Is the added convenience worth breaking existing conventions ?
  - Is it worth breaking existing applications ?

I've suggested making the repr() output configurable to address
the convenience aspect of your proposal. You could then set the
output encoding to e.g. "unicode-printable" and get your preferred
output. The default could remain set to the current all-ASCII output.

Hardwiring the encoding is not a good idea, esp. since there
are lots of alternatives for you to get readable output from
PyUnicode object now and without any changes to the interpreter.

E.g.

print '%s' % u.encode('utf-8')

or

print '%s' % u.encode('shift-jis')

or

logfile = open('my.log', encoding='unicode-printable')
logfile.write(u)

or

def unicode_repr(u):
     return u.encode('unicode-printable')
print '%s' % unicode_repr(u)

There are many ways to solve your problem.

In summary, I am:

  -1 on hardwiring the unicode repr() output to a non-ASCII
     encoding

  +1 on adding the PyUnicode_ISPRINTABLE() API

  +1 on adding a unicode-printable codec which implements
     your suggested encoding, so that you can use it for e.g.
     log files or as sys.stdout encoding

  +0 on making unicode repr() encoding adjustable

Regards,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 14 2008)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
            Registered at Amtsgericht Duesseldorf: HRB 46611

On 2008-05-09 19:23, Atsuo Ishimoto wrote:
> On Fri, May 9, 2008 at 1:52 AM, M.-A. Lemburg <mal at egenix.com> wrote:
>>>>  For sys.stdout this doesn't make sense at all, since it hides encoding
>>>>  errors for all applications using sys.stdout as piping mechanism.
>>>>  -1 on that.
>>> You can raise UnicodeEncodigError for encoding errors if you want, by
>>> setting sys.stdout's error-handler to `strict`.
>> No, that's not a good idea. I don't want to change every single
>> affected application just to make sure that they don't write
>> corrupt data to stdout.
> 
> The changes you need to make for your applications will be so small
> that I don't think this is valid argument.
> And number of applications you need to change will be rather small.
> What you call  "corrupt data" are just hex-escaped characters of
> foreign language. In most case, printing(or writing to file) such
> string doesn't harm, so I think raising exception by default is
> overkill. Java doesn't raise exception for encoding error, but just
> print `?`. .NET languages such as C# also prints '?'. Perl prints
> hex-escaped string, as proposed in this PEP.
> 
>>> Even though this PEP was rejected,
>> You mean PEP 3138 was rejected ??
> 
> Er, I should have written "Even if this PEP was ...", perhaps.
> 
>> Well, "annoying" is not good enough for such a big change :-)
> 
> So? Annoyance of Perl was enough reason to change entire language for me :-)
> 
>> The backslashreplace idea may have some merrits in interactive
>> Python sessions or IDLE, but it hides encoding errors in all
>> other situations.
> 
> Encoding errors are not hidden, but are represented by hex-escaped
> strings. We can get much more information about the string being
> printed than printing tracebacks.
> 
>> I'm not against changing the repr() of Unicode objects, but
>> please make sure that this change does not break debugging
>> Python applications.Whether you're debugging an app using
>> 'print' statements, piping repr() through a socket to a remote
>> debugger or writing information to a log file. The important
>> factor to take into account is the other end that will receive
>> the data.
> 
> I think your request is too vague to be completed. This proposal
> improve current broken debugging for me, and I see no lost information
> for debugging. But the "other end" may be too vary to say something.
> 
>> BTW: One problem that your PEP doesn't address, which I mentioned
>> on the ticket:
>>
>> By putting all printable chars into the repr() you lose the
>> ability to actually see the number of code points you have
>> in a Unicode string.
>>
> 
> With current repr(), I can not get any information other than number
> of code points. This is not what I want to know by printing repr().
> For length of the string, I'll just do print(len(s)).
> 
>> Please name the property Py_UNICODE_ISPRINTABLE. Py_UNICODE_ISHEXESCAPED
>> isn't all that intuitive.
> 
> The name `Py_UNICODE_ISPRINTABLE` came to my mind at first, but I was
> not sure the `printable`  is accurate word. I'm okay for
> Py_UNICODE_ISPRINTABLE, but I'd like to hear opinions. If no one
> objects Py_UNICODE_ISPRINTABLE, I'll go for it.
> 
>> How can things easily be changed so that it's possible to get the
>> Py2.x style hex escaping back into Py3k without having to change
>> all repr() calls and %r format markers for Unicode objects ?
> 
> I didn't intend to imply "without having to change".  Perhaps,
> "migrate" would be wrong word and "port" may be better.
> 
> For repr() and %r format, they are unlikely to be changed in most
> case. They need to be changed if pure ASCII are required even if your
> locale is capable to print the strings.
> 
>> I can see your point with it being easier to read e.g. German,
>> Japanese or Korean data, but it still has to be possible to
>> use repr() for proper debugging which allows the user to
>> actually see what is stored in a Unicode object in terms of
>> code points.
> 
> You can see code points easily, the function I wrote in the PEP to
> convert such strings as repr() in Python 2 is good example. But I
> believe ordinary use-case prefer readable string over code points.