[Python-Dev] PEP 460 reboot

Tue Jan 14 06:03:51 CET 2014

On Mon, Jan 13, 2014 at 6:25 PM, Terry Reedy <tjreedy at udel.edu> wrote:
> On 1/13/2014 4:32 PM, Guido van Rossum wrote:
>
>> I will doggedly keep posting to this thread rather than creating more
>> threads.
>
> Please permit to to doggedly keep pointing you toward the possible solution
> I posted on the tracker last October.

You're talking about http://bugs.python.org/issue3982 right?

>> But formatb() feels absurd to me. PEP 460 has neither a precise
>> specification or any actual examples, so I can't tell whether the
>
> Two days ago, I reposted byteformat() here on pydev with a precise text
> specification added to the code, and with an expanded test example. I have
> just added another example based on your question below.

That new example hasn't made it to my inbox yet, and I don't see
anything very recent in that issue either. But I don't think it
matters.

>> intention is that the format string can *only* contain {...} sequences
>> or whether it can also contain "regular" characters. Translating to
>> formatb(), my question comes down to the legality of the following
>> example:
>>
>>    b'Hello, {}'.formatb(name)  # Where name is some bytes object
>>
>> If this is allowed, it reintroduces the ASCII bias (since the
>> substring 'Hello' is clearly ASCII).
>
> Since byteformat() uses re to find {<format-spec>} replacement fields, it
> only has such ascii bias as re has, which I believe is not much, if any. As
> far as re and byteformat are concerned, everything outside of the {...}
> fields is uninterpreted bytes. As far as bytes.join is concerned, both
> joiner and joined are uninterpreted bytes.
>
>>>> byteformat(b'\x00{}\x02{}def', (b'\x01', b'abc',))
> b'\x00\x01\x02abcdef'
>
> re.split produces [b'\x00', b'', b'\x02', b'', b'def']. The only ascii bias
> is the one already present is the representation of bytes, and the fact that
> Python code must have an ascii-compatible encoding.

I don't think it's that easy. Just searching for '{' is enough to
break in surprising ways unless the format string is encoded in an
ASCII superset. I can think of two easy examples to illustrate this
(they're similar to the example I posted here before about the
essential ASCII-ness of %c).

First, let's consider EBCDIC. The '{' character in ASCII is hex 7B
(decimal 123). I looked it up (http://en.wikipedia.org/wiki/EBCDIC)
and that is the '#' character in EBCDIC. Surprised yet?

Next, let's consider UTF-16. This encoding uses two bytes per
character (except for surrogates), so any character whose top half or
bottom half happens to be 7B hex will cause an incorrect hit for your
regular expression. Ouch.

Of course, nobody in their right mind would use a format string
containing UTF-16 or EBCDIC. And that is precisely my point. When
you're using a format string, all of the format string (not just the
part between { and }) had better use ASCII or an ASCII superset. And
this (rightly) constrains the output to an ASCII superset as well.

> The advantage of
> byteformat(b'\x00{}\x02{}def', (b'\x01', b'abc',))
> over directly writing
> b''.join([b'\x00', b'\x01', b'\x02', b'abc', b'def']
> is that one does not have to manually split the presumably constant template
> into chunks and interleave them with the presumable variable chunks.

Yes. And that's a great feature when the output is a known encoding
that's an ASCII superset. But a terrible idea when the encoding is
unconstrained.

> Here is the example that I used for testing, including non-blank format
> specs.
>
> bformat = b"bytes: {}; bytearray: {:}; unicode: {:s}; int: {:5d}; float:
> {:7.2f}; end"
> objects = (b'abc', bytearray(b'def'), u'ghi', 123, 12.3)
> result = byteformat(bformat, objects)
>>>>
> b'bytes: abc; bytearray: def; unicode: ghi; int:   123; float:   12.30; end'

No surprises here. And in fact I think this is the desired outcome.

> The additional advantage here is the automatic encoding of formatted strings
> to bytes. As posted, byteformat() uses the str.encode defaults
> (encoding='utf-8', errors='strict'). But as I said in the post, these could
> become parameters to the function that are passed on to str.encode.

As long as that encoding is an ASCII superset this might be useful.

> The design reuses re.split, bytes.join, format, and the format
> specification. By re-using the format-spec as is, the only new thing to
> learn is that blank specs correspond to bytes instead of strings. This is
> easier to design, implement, and learn than if the format-spec is limited to
> disallow some things (after much bike-shedding over what to eliminate ;-).
>
> I would appreciate your comment on this proposal.

It seems to be a bit weak on the bytes encoding -- I would like to see
an explicit format code for those (your code looks a little clever in
this area). Others will probably object that it makes it too easy to
encode text by default, although I'm not sure it matters, given that
the behavior is quite different from Python 2's broken treatment of
interpolating Unicode in an 8-bit format string. All in all it mostly
looks like a sane spec though.

-- 
--Guido van Rossum (python.org/~guido)