[Python-Dev] PEP 460: allowing %d and %f and mojibake

Scott Dial scott+python-dev at scottdial.com
Mon Jan 13 04:26:24 CET 2014


On 2014-01-11 22:09, Nick Coghlan wrote:
> For Python 2 folks trying to grok where the "bright line" is in terms of
> the Python 3 text model: if your proposal includes *any* kind of
> implicit serialisation of non binary data to binary, it is going to be
> rejected as an addition to the core bytes type. If it avoids crossing
> that line (as the buffer-API-only version of PEP 460 does), then we can
> talk.

To take such a hard-line stance, I would expect you to author a PEP to
strip the ASCII conveniences from the bytes and bytearray types.
Otherwise, I find it a bit schizophrenic to argue that methods like
lower, upper, title, and etc. don't implicitly assume encoding:

>>> a = "scott".encode('utf-16')
>>> b = a.title()
>>> c = b.decode('utf-16')
'SCOTT'

So, clearly title() not only depends on the bytes characters encoded in
a superset of ASCII characters, it depends on the bytes being a sequence
of ASCII characters, which looks an awful lot like an operation on an
implicit encoded string.

>>> b"文字化け"
  File "<stdin>", line 1
SyntaxError: bytes can only contain ASCII literal characters.

There is an implicit serialization right there. My terminal is utf8 (or
even if my source encoding is utf8), so why would that not be:

b'\xe6\x96\x87\xe5\xad\x97\xe5\x8c\x96\xe3\x81\x91'

I sympathize with Ethan that the bytes and bytearray types already seem
to concede that bytes is the type you want to use for 7-bit ASCII
manipulations. If that is not what we want, then we are not doing a good
job communicating that to developers with the API. At the onset, the
bytes literal itself seems to be an attractive nuisance as it gives a
nod to using bytes for ASCII character sequences (a.k.a ASCII strings).

Regards,
-Scott

-- 
Scott Dial
scott at scottdial.com


More information about the Python-Dev mailing list