[Python-Dev] PEP 460 reboot

Terry Reedy tjreedy at udel.edu
Tue Jan 14 22:55:44 CET 2014


Let me answer you both since the issues are related.

On 1/14/2014 7:46 AM, Nick Coghlan wrote:

>> Guido van Rossum writes:
>>   > And that is precisely my point. When you're using a format string,

Bytes interpolation uses a bytes format, or a byte string if you will, 
but it should not be thought of as a character or text string. Certain 
bytes (123 and 125) delimit a replacement field. The bytes in between 
define, in my version, a format-spec after being ascii-decoded to text 
for input to 3.x format(). The decoding and subsequent encoding would 
not be needed if 2.7 format(ob, byte-spec) were available.

>>   > all of the format string (not just the part between { and }) had
>>   > better use ASCII or an ASCII superset.

I am not even sure what you mean here. The bytes outside of 123 and 125 
are simply copied to the output string. There is no encoding or 
interpretation involved.

It is true that the uninterpred bytes best not contain a byte pattern 
mistakenly recognized as a replacement field. I plan to refine the 
relational expression byte pattern used in byteformat to sharply reduce 
the possibility of such errors. When such errors happen anyway, an 
exception should be raised, and I plan to expand the error message to 
give more diagnostic information.

>> And this (rightly) constrains the output to an ASCII superset as well.

What does this mean? I suspect I disagree. The bytes interpolated into 
the output bytes can be any bytes.

>> Except that if you interpolate something like Shift JIS,

Bytes interpolation interpolates bytes, not encodings. A 
self-identifying byte stream starts with bytes in a known encoding that 
specifies the encoding of the rest of the stream. Neither part need be 
encoded text. (Would that something like were standard for encoded text 
streams, as well as for serialized images.)

 >> [snip]

> Right, that's the danger I was worried about, but the problem is that
> there's at least *some* minimum level of ASCII compatibility that
> needs to be assumed in order to define an interpolation format at all
> (this is the point I originally missed).

I would put this sightly differently. To process bytes, we may define 
certain bytes as metabytes with a special meaning. We may choose the 
bytes that happen to be the ascii encoding of certain characters. But 
once the special numbers are chosen, they are numbers, not characters.

The problem of metabytes having both a normal and special meaning is 
similar to the problem of metacharacters having both a normal and 
special meaning.

> For printf-style formatting,
> it's % along with the various formatting characters and other syntax
> (like digits, parentheses, variable names and "."), with the format
> method it's braces, brackets, colons, variable names, etc.

It is the bytes corresponding to these characters. This is true also of 
the metabytes in an re module bytes pattern.

> The mini-language parser has to assume in encoding
 > in order to interpret the format string,

This is where I disagree with you and Guido. Bytes processing is done 
with numbers 0 <= n <= 255, not characters. The fact that ascii 
characters can, for convenience, be used in bytes literals to indicate 
the corresponding ascii codes does not change this. A bytes parser looks 
for certain special numbers. Other numbers need not be given any 
interpretation and need not represent encoded characters.

 > and that's *all* done assuming an ASCII compatible format string

Since any bytes can be be regarded as an ascii-compatible latin-1 
encoded string, that seems like a vacuous assumption. In any case, I do 
not seen any particular assumption in the following, other than the 
choice of replacement field delimiters.

 >>> list(byteformat(bytes([1,2,10, 123, 125, 200]),
    (bytes([50, 100, 150]),)))
[1, 2, 10, 50, 100, 150, 200]

 > (which must make life interesting if you try to use an
> ASCII incompatible coding cookie for your source code - I'm actually
> not sure what the full implications of that *are* for bytes literals
> in Python 3).

An interesting and important question. The Python 2 manual says that the 
coding cookie applies to only to comments and strings. To me, this 
suggests that any encoding can be used. I am not sure how and when the 
encoding is applied. It suggests that the sequence of bytes resulting 
from a string literal is not determined by the sequence of characters 
comprising the string literal, but also depends on the coding cookie.

The Python 3 manual says that the coding cookie applies to the whole 
source file. To me, this says that the subset of unicode chars included 
in the encoding *must* include the ascii characters. It also suggest to 
me that the encoding must also ascii-compatible, in order to read the 
encoding in the ascii-text coding cookie (unless there is a fallback to 
the system encoding).

In any case, a 3.x source file is decoded to unicode. When the sequence 
of unicode chars comprising a bytes literal is interpreted, the 
resulting sequence of bytes depends only on the literal and not the file 
encoding. So list(b'()'), for instance, should always be [123, 125] in 
3.x. My comments above about byte processing assume that this is so.

-- 
Terry Jan Reedy



More information about the Python-Dev mailing list