[Python-3000] PEP 3138- String representation in Python 3000

Fri May 9 19:23:07 CEST 2008

On Fri, May 9, 2008 at 1:52 AM, M.-A. Lemburg <mal at egenix.com> wrote:
>>>  For sys.stdout this doesn't make sense at all, since it hides encoding
>>>  errors for all applications using sys.stdout as piping mechanism.
>>>  -1 on that.
>>
>> You can raise UnicodeEncodigError for encoding errors if you want, by
>> setting sys.stdout's error-handler to `strict`.
>
> No, that's not a good idea. I don't want to change every single
> affected application just to make sure that they don't write
> corrupt data to stdout.

The changes you need to make for your applications will be so small
that I don't think this is valid argument.
And number of applications you need to change will be rather small.
What you call  "corrupt data" are just hex-escaped characters of
foreign language. In most case, printing(or writing to file) such
string doesn't harm, so I think raising exception by default is
overkill. Java doesn't raise exception for encoding error, but just
print `?`. .NET languages such as C# also prints '?'. Perl prints
hex-escaped string, as proposed in this PEP.

>> Even though this PEP was rejected,
>
> You mean PEP 3138 was rejected ??

Er, I should have written "Even if this PEP was ...", perhaps.

> Well, "annoying" is not good enough for such a big change :-)

So? Annoyance of Perl was enough reason to change entire language for me :-)

> The backslashreplace idea may have some merrits in interactive
> Python sessions or IDLE, but it hides encoding errors in all
> other situations.

Encoding errors are not hidden, but are represented by hex-escaped
strings. We can get much more information about the string being
printed than printing tracebacks.

> I'm not against changing the repr() of Unicode objects, but
> please make sure that this change does not break debugging
> Python applications.Whether you're debugging an app using
> 'print' statements, piping repr() through a socket to a remote
> debugger or writing information to a log file. The important
> factor to take into account is the other end that will receive
> the data.

I think your request is too vague to be completed. This proposal
improve current broken debugging for me, and I see no lost information
for debugging. But the "other end" may be too vary to say something.

> BTW: One problem that your PEP doesn't address, which I mentioned
> on the ticket:
>
> By putting all printable chars into the repr() you lose the
> ability to actually see the number of code points you have
> in a Unicode string.
>

With current repr(), I can not get any information other than number
of code points. This is not what I want to know by printing repr().
For length of the string, I'll just do print(len(s)).

>
> Please name the property Py_UNICODE_ISPRINTABLE. Py_UNICODE_ISHEXESCAPED
> isn't all that intuitive.

The name `Py_UNICODE_ISPRINTABLE` came to my mind at first, but I was
not sure the `printable`  is accurate word. I'm okay for
Py_UNICODE_ISPRINTABLE, but I'd like to hear opinions. If no one
objects Py_UNICODE_ISPRINTABLE, I'll go for it.

>
> How can things easily be changed so that it's possible to get the
> Py2.x style hex escaping back into Py3k without having to change
> all repr() calls and %r format markers for Unicode objects ?

I didn't intend to imply "without having to change".  Perhaps,
"migrate" would be wrong word and "port" may be better.

For repr() and %r format, they are unlikely to be changed in most
case. They need to be changed if pure ASCII are required even if your
locale is capable to print the strings.

> I can see your point with it being easier to read e.g. German,
> Japanese or Korean data, but it still has to be possible to
> use repr() for proper debugging which allows the user to
> actually see what is stored in a Unicode object in terms of
> code points.

You can see code points easily, the function I wrote in the PEP to
convert such strings as repr() in Python 2 is good example. But I
believe ordinary use-case prefer readable string over code points.