[Python-Dev] PEP 461 updates

Fri Jan 17 06:36:11 CET 2014

On Fri, Jan 17, 2014 at 11:19:44AM +0900, Stephen J. Turnbull wrote:
> Meta enough that I'll take Guido out of the CC.
> 
> Nick Coghlan writes:
> 
>  > There are plenty of data formats (like SMTP and HTTP) that are
>  > constrained to be ASCII compatible,
> 
> "ASCII compatible" is a technical term in encodings, which means
> "bytes in the range 0-127 always have ASCII coded character semantics,
> do what you like with bytes in the range 128-255."[1]

Examples, and counter-examples, may help. Let me see if I have got this 
right: an ASCII-compatible encoding may be an ASCII-superset like 
Latin-1, or a variable-width encoding like UTF-8 where the ASCII chars 
are encoded to the same bytes as ASCII, and non-ASCII chars are not. A 
counter-example would be UTF-16, or some of the Asian encodings like 
Big5. Am I right so far?

But Nick isn't talking about an encoding, he's talking about a data 
format. I think that an ASCII-compatible format means one where (in at 
least *some* parts of the data) bytes between 0 and 127 have the same 
meaning as in ASCII, e.g. byte 84 is to be interpreted as ASCII 
character "T". This doesn't mean that every byte 84 means "T", only that 
some of them do -- hopefully a well-defined sections of the data. Below, 
you introduce the term "ASCII segments" for these.

> Worse, it's clearly confusing in this discussion.  Let's stop using
> this term to mean
> 
>     the data format has elements that are defined to contain only
>     bytes with ASCII coded character semantics
> 
> (which is the relevant restriction AFAICS -- I don't know of any
> ASCII-compatible formats where the bytes 128-255 are used for any
> purpose other than encoding non-ASCII characters).  OTOH, if it *is*
> an ASCII-compatible text encoding, the semantics are dubious if the
> bytes versions of many of these methods/operations are used.
> 
> A documentation suggestion: It's easy enough to rewrite
> 
>  > constrained to be ASCII compatible, either globally, or locally in
>  > the parts being manipulated by an application (such as a file
>  > header). ASCII incompatible segments may be present, but in ways
>  > that allow the data processing to handle them correctly.
> 
> as 
> 
>     containing 'well-defined segments constrained to be (strictly)
>     ASCII-encoded' (aka ASCII segments).
> 
> And then you can say 
> 
>     <specified bytes methods> are designed for use *only* on bytes
>     that are ASCII segments; use on other data is likely to cause
>     hard-to-diagnose corruption.

An example: if you have the byte b'\x63', calling upper() on that will 
return b'\x43'. That is only meaningful if the byte is intended as the 
ASCII character "c".

> Footnotes: 
> [1]  "ASCII coded character semantics" is of course mildly ambiguous
> due to considerations like EOL conventions.  But "you know what I'm
> talking about".

I think I know what your talking about, but don't know for sure unless I 
explain it back to you.

-- 
Steven