[Python-Dev] PEP 460: allowing %d and %f and mojibake

Ethan Furman ethan at stoneleaf.us
Mon Jan 13 05:00:33 CET 2014

On 01/12/2014 07:02 PM, Stephen J. Turnbull wrote:

[snip most of very eloquent reply]

Thank you, Stephen, for remaining calm despite my somewhat heated response.

A few comments in-line.

I now better understand your viewpoint about text always being unicode strings; I just happen to disagree.

Hopefully as some consolation I will be very vocal about using str unless bytes is necessary.  Any application that uses 
text should be using str for it, and only using bytes, if necessary, on the back-end.

> Ethan Furman writes:
>> In only one case did I use the word "text" loosely,
> [...] Bytes are *never* Python 3 text in my terminology [...] "ASCII-encoded text"
> as you call it [...] and want to manipulate using str-like methods on bytes

The part that you don't seem to acknowledge (sorry if I missed it) is that there are str-like methods already on bytes. 
  While the actual implementation of isupper (your example from below) may be done using integer methods, it only makes 
semantic sense if interpreted as ASCII-encoded text.

> is *exactly* the Python 2 model of text.  But you deny that the
> effect of your proposals (eg, b"%d" % (12,)) is to reintroduce Python
> 2's bytes/character confusion, don't you?

Given that the default (and only) text type in Py3 is str, which is unicode, I don't think any confusion will be as 
severe, but I acknowledge that there could be some.

> I hardly think Nick is *lying*, any more than you are.  AFAICT, you're
> *both* wrong.

LOL, well, at least I'm in good company, then!  :)

>> I think some of the misunderstanding (which you also seem to suffer
>> from) is that we (or at least I) /ever/ want a unicode string back
>> from bytes interpolation.  I don't!
> Please tell me why you think I suffer from that misunderstanding.

I no longer recall, but whatever misapprehension I was suffering from you have alleviated.  (That sentence would make my 
daughter pround!  English major. ;)

> But did you get that I'm worried that programmers in Omaha will use
> that same functionality to communicate American English (for which it
> is basically sufficient, and which also requires ASCII when bytes are
> used for communication)?

Yes, I get that.  Hopefully their friends and neighbors will slap them with fishes if they do.

>> *My* definition is not ambiguous at all.  If this particular part
>> of the byte stream is defined to contain ASCII-encoded text, then I
>> can use the bytes text methods to work with it.
> But how is Python supposed to know that?

Python doesn't need to.  bytes is a low-level object -- it could contain music, movies, dbf data, pdf data, or my 
mothers cheesecake recipe (properly encoded, of course).  Python can't protect me from treating a music file as if it 
were a movie file, or even just writing proper music info at the wrong place in the music file;  all that is up to me, 
as the programmer, to get right, and to understand what is needed.

> But under your definition, you need to make the decision, or
> explicitly code the decision, on the basis of context.

Exactly so.  I even have to do that in Py2.

>> If that particular configuration of bytes is because it's
>> ASCII-encoded text, then sure.
> Once again, you are advocate precisely the Python 2 model of text.

Not exactly, because what I get back is bytes, which cannot directly be mixed with unicode (str) as it was in Py2.  I 
think this is a key difference.

>> To use, for example, bytes.__upper__ on data that wasn't
>> ASCII-encoded text (even if it happened to look like it was) would
>> be the height of stupidity.  Please don't include me in such
>> accusations.
> I have no idea why you think I think anybody would be that stupid.
> That never occured to me.  It's precisely "magic numbers" that happen
> to look like English words when interpreted as ASCII coded characters
> that I don't want manipulated by str-like methods that interpret text
> (such as full-featured format or %).

This confuses me somewhat.  It's okay to use b'ethan'.upper(), which only makes semantic sense as ASCII-encoded text, 
but b'age: %d' % 43 isn't?  (Aside, I'm perfectly comfortable with "ASCII-encoded text" because if you took 
u'ethan'.encode('ascii') you would get b'ethan'.  If it was some other encoding, such as cp1251, I would call that 
particular byte stream "cp1251-encoded text".  And if there were methods that worked directly on a cp1251-encoded byte 
stream I would not have any problem using them on cp1251-encoded text.)

> What Nick
> means by a "boundary type" is a type that works seamlessly with the
> types on each side of the boundary as a helper in the conversion.  So
> when you use a struct to pack a bool, an int, and a date into a bytes,
> the struct is the boundary type.  And if there's a helper type to work
> with bytes and/or str simultaneously, that's a boundary type, eg,
> asciistr.  But bytes itself is not a boundary type, it's just a type
> with no internal structure, not even characters.

Hmmm.  I'll have to think about this.

Okay, I've thought somewhat.  Under the definition above would it be fair to say that Db3Table (a class in my dbf 
module) is a boundary type?  It sits between the actual file and the program, and transforms bytes into actual Python types.


More information about the Python-Dev mailing list