[Python-Dev] PEP 460: allowing %d and %f and mojibake

Ethan Furman ethan at stoneleaf.us
Sun Jan 12 20:14:50 CET 2014


On 01/12/2014 11:00 AM, Paul Moore wrote:
>
> And yet I still don't follow what you *want*. Unless it's that b'%d' %
> (12,) must work and give b'12', and nothing else is acceptable.

Nothing else is ideal.  I'll go that route if I have to.  I understand that in the real world you go with what works, 
but in the development stage you fight for the ideal.  :)


> My reading of Nick's refusal is that %d takes a value which is
> semantically a number, converts it into a base-10 representation
> (which is semantically a *string*, not a sequence of bytes[1]) and
> then *encodes* that string into a series of bytes using the ASCII
> encoding. That is *two* semantic transformations, and one (the ASCII
> encoding) is *implicit*. Specifically, it's implicit because (a) the
> normal reading of %d is "produce the base-10 representation of a
> number, and a base-10 representation is a *string*, and (b) because
> nowhere has ASCII been mentioned (why not UTF16? that would be
> entirely plausible for a wchar-based environment like Windows). And a
> core principle of the bytes/text separation in Python 3 is that
> encoding should never happen implicitly.

That could be.  And yet the bytes type already has several concessions to ASCII encoding.


> By the way, I should point out that I would never have understood
> *any* of the ideas involved in this thread before Python 3 forced me
> to think about Unicode and the distinction between text and bytes. And
> yet, I now find myself, in my (non-Python) work environment, being the
> local expert whenever applications screw up text encodings. So I, for
> one, am very grateful for Python 3's clear separation of bytes and
> text. (And if I sometimes come across as over-dogmatic, I apologise -
> put it down to the enthusiasm of the recent convert :-))

No worries.  I was forced to learn the difference when I wrote my dbf module for 2.5.  Took longer than I'd like to 
admit to realize that ASCII was an encoding.  :/


> [1] If you cannot see that there's no essential reason why the base-10
> representation '123' should correspond to the bytes b'\x31\x32\x33'
> then you are probably not old enough to have started programming on
> EBCDIC-based computers :-)

I can see it.  :)  But bytes already acknowledges an ASCII bias.  ;)  And even EBCDIC machines speak ASCII when talking 
telnet.

--
~Ethan~


More information about the Python-Dev mailing list