[Python-Dev] PEP 460 reboot

Stephen J. Turnbull stephen at xemacs.org
Thu Jan 16 05:39:30 CET 2014


Nick Coghlan writes:

 > Yes, I'm currently thinking the appropriate approach to the docs
 > will be to remove the current "these have most of the str methods
 > too" paragraph for binary sequences and instead create three
 > completely explicit lists of methods:

 >   - provided, works with arbitrary data

 >   - provided, assumes the use of an ASCII compatible data format

I'm not sure what that means.  If you mean that in the format string
for .format() and %-formatting, bytes 0-127 must always have ASCII
coded character semantics with bytes 128-255 unrestricted, indeed,
that is the pragmatic restriction.  Is there anything else?

The implications of this should be made clear, though: funky Asian
encodings cannot be safely used in format strings for format(),
GB18030 isn't safe in %-formatting either, and the value returned by
these operations should be assumed to be non-ASCII-compatible unless
proven otherwise (no iterated formatting).

I think you also need

  - provided, assumes pure ASCII-encoded text

since as far as I know the only strictly ASCII-compatible binary
formats are ISO 2022-compatible encodings and UTF-8, ie, text, and the
characters represented with bytes in the range 128-255 are not handled
by bytes versions of the case-checking and case-converting operations,
and so have extremely dubious semantics unless the data is pure ASCII.
This is also true of most of the is_* operations.

Note that .center and .strip have pretty dubious semantics for
arbitrary "ASCII-compatible" data:

>>> b"abc\r\n".center(15)
b'     abc\r\n     '

>>> " \xA0abc\xA0 ".strip()
'abc'
>>> b" \xA0abc\xA0 ".strip()
b'\xa0abc\xa0'

Of course the case of .center() is purely a programmer error, and I
don't have a use case where it's problematic in practice.  But it's
sort of unpleasant.

Although I have internalized Guido's point that what's important is
that there be no implicit conversions between bytes and str, I still
worry that this slew of subtle semantic differences when moving str
methods wholesale to bytes is a bug magnet.

I have an especially bad feeling about str-into-bytes interpolation.
If people want that, they should use a type like asciistr that
provides more or less firm guarantees that the content is pure ASCII.

 >   - not provided

 > PEP 461 would add a fourth category, of being provided, but with
 > more restricted semantics.

I haven't looked closely at PEP 461 yet, and I'm not sure I'm going to
have time this week.


More information about the Python-Dev mailing list