[Python-Dev] PEP 460 reboot
Stephen J. Turnbull
stephen at xemacs.org
Thu Jan 16 05:39:30 CET 2014
Nick Coghlan writes:
> Yes, I'm currently thinking the appropriate approach to the docs
> will be to remove the current "these have most of the str methods
> too" paragraph for binary sequences and instead create three
> completely explicit lists of methods:
> - provided, works with arbitrary data
> - provided, assumes the use of an ASCII compatible data format
I'm not sure what that means. If you mean that in the format string
for .format() and %-formatting, bytes 0-127 must always have ASCII
coded character semantics with bytes 128-255 unrestricted, indeed,
that is the pragmatic restriction. Is there anything else?
The implications of this should be made clear, though: funky Asian
encodings cannot be safely used in format strings for format(),
GB18030 isn't safe in %-formatting either, and the value returned by
these operations should be assumed to be non-ASCII-compatible unless
proven otherwise (no iterated formatting).
I think you also need
- provided, assumes pure ASCII-encoded text
since as far as I know the only strictly ASCII-compatible binary
formats are ISO 2022-compatible encodings and UTF-8, ie, text, and the
characters represented with bytes in the range 128-255 are not handled
by bytes versions of the case-checking and case-converting operations,
and so have extremely dubious semantics unless the data is pure ASCII.
This is also true of most of the is_* operations.
Note that .center and .strip have pretty dubious semantics for
arbitrary "ASCII-compatible" data:
b' abc\r\n '
>>> " \xA0abc\xA0 ".strip()
>>> b" \xA0abc\xA0 ".strip()
Of course the case of .center() is purely a programmer error, and I
don't have a use case where it's problematic in practice. But it's
sort of unpleasant.
Although I have internalized Guido's point that what's important is
that there be no implicit conversions between bytes and str, I still
worry that this slew of subtle semantic differences when moving str
methods wholesale to bytes is a bug magnet.
I have an especially bad feeling about str-into-bytes interpolation.
If people want that, they should use a type like asciistr that
provides more or less firm guarantees that the content is pure ASCII.
> - not provided
> PEP 461 would add a fourth category, of being provided, but with
> more restricted semantics.
I haven't looked closely at PEP 461 yet, and I'm not sure I'm going to
have time this week.
More information about the Python-Dev