[Python-Dev] PEP 460: allowing %d and %f and mojibake

Mon Jan 13 17:30:21 CET 2014

On 01/13/2014 02:48 AM, Stephen J. Turnbull wrote:
> Ethan Furman writes:
>
>> The part that you don't seem to acknowledge (sorry if I missed it)
>> is that there are str-like methods already on bytes.
>
> I haven't expressed myself well, but I don't much care about that.

You don't care that there are str-like methods on bytes?  Whether you do or not, they are there, and they impact how 
people think about bytes and what is (and what should be) allowed.

> It's what Knuth would classify as a seminumerical method.

I do not see how that's relevant.  What matters is not how we can manipulate the data (everything is reduced to numbers 
at some point), but what the data represents.

[snip]

>>>> *My* definition is not ambiguous at all.  If this particular part
>>>> of the byte stream is defined to contain ASCII-encoded text, then I
>>>> can use the bytes text methods to work with it.
>>>
>>> But how is Python supposed to know that?
>>
>> Python doesn't need to.
>
> ... because you know it.  But the ideal of object-oriented programming
> (and duck-typing) is that you shouldn't need to; the object should
> know how to produce appropriate behavior itself.

The ideal, sure.  But if you're stuck with using a list to hold data for your higher-order recursive function are you 
going to expect the list data type to "know" which pops and inserts are allowed and which are not?  Of course not.  And 
you'd probably build a proper class on top of the list so those things could be checked.  Now imagine that the list type 
didn't offer insert and pop, and you had to use slice replacement -- what a pain that would be!

[snip]

>>> But under your definition, you need to make the decision, or
>>> explicitly code the decision, on the basis of context.
>>
>> Exactly so.  I even have to do that in Py2.
>
> "Even."  This is exactly where PBP and EIBTI part company, I think.
> EIBTI thinks its a bad idea to pass around bytes that are implicitly
> some other type

bytes are /always/ implicitly some other type.  They are basically raw data.  They are given meaning by how we interpret 
them.

[snip]

> Even though "ethan" is perfectly good ASCII-encoded text (as well as
> the integer 435,744,694,638 on a bigendian machine with 5-byte words,
> and you have no way of knowing whether it was user data (CP1251) or a
> metadata keyword (ASCII) or be the US national debt in 1967 dollars
> (integer) when b'ethan' shows up in a trace?

Context is everything.  If b'ethan' shows up in a trace I would have to examine the surrounding code to see how those 
bytes were being used.

>> And if there were methods that worked directly on a cp1251-encoded
>> byte stream I would not have any problem using them on
>> cp1251-encoded text.)
>
> I was afraid of that: all of those methods (except the case methods)
> will work fine on a cp1251-encoded text.

Really?  Huh.  They wouldn't work fine with the Spanish alphabet.  I should've used that for my example.  :/

> And because they only know
> that the string is bytes, the case methods will silently corrupt your
> "text" as soon as they get a chance.

Inevitably there are methods that will "work" even if given the wrong data type, while others will either corrupt or 
blow up if not given exactly what they expect.  You tell me that some ASCII methods will work okay on cp1251 text, and 
others will not.  So I'm not going to use any of them on cp1251 as that is not what they are intended for.

> That bothers me, even if it
> doesn't bother you.  Purity again, if you like.  (But you'd take a
> safe .upper if you got it for free, no?)

Well, there is no such thing as free.  ;)  And there already is a safe .upper -- str.upper.  And if I don't know that my 
bytes are ASCII, but I did know they were text, I wouldn't use ASCII methods, I'd convert to str and work there.

--
~Ethan~