[Python-Dev] PEP 460 reboot

Nick Coghlan ncoghlan at gmail.com
Tue Jan 14 06:25:29 CET 2014


On 14 January 2014 15:03, Guido van Rossum <guido at python.org> wrote:
> I don't think it's that easy. Just searching for '{' is enough to
> break in surprising ways unless the format string is encoded in an
> ASCII superset. I can think of two easy examples to illustrate this
> (they're similar to the example I posted here before about the
> essential ASCII-ness of %c).
>
> First, let's consider EBCDIC. The '{' character in ASCII is hex 7B
> (decimal 123). I looked it up (http://en.wikipedia.org/wiki/EBCDIC)
> and that is the '#' character in EBCDIC. Surprised yet?
>
> Next, let's consider UTF-16. This encoding uses two bytes per
> character (except for surrogates), so any character whose top half or
> bottom half happens to be 7B hex will cause an incorrect hit for your
> regular expression. Ouch.
>
> Of course, nobody in their right mind would use a format string
> containing UTF-16 or EBCDIC. And that is precisely my point. When
> you're using a format string, all of the format string (not just the
> part between { and }) had better use ASCII or an ASCII superset. And
> this (rightly) constrains the output to an ASCII superset as well.

In case it got lost amongst the various threads, this was the argument
that finally convinced me that interpolation *inherently* assumes an
ASCII compatible encoding: the assumption of ASCII compatibility is
embedded in the design of the formatting syntax for both printf-style
formatting and the format methods. That places interpolation support
squarely in the same category as all the other bytes methods that
inherently assume ASCII, and thus remains consistent with the Python 3
text model.

Originally I was thinking that the ASCII assumption applied only if
one of the passed in *values* needed to be implicitly encoded as
ASCII, without accounting for the fact that the parser itself assumed
ASCII compatibility when searching for formatting metacharacters. Once
Guido pointed out that oversight on my part, my objections collapsed,
since this observation makes it clear that there's *no* coherent way
to offer a pure binary interpolation API - the only general purpose
combination mechanism for segments of binary data that can avoid
making assumptions about the encodings of metacharacters is simple
concatenation.

Regards,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-Dev mailing list