[Python-Dev] PEP 461 updates
Stephen J. Turnbull
stephen at xemacs.org
Fri Jan 17 10:59:30 CET 2014
Steven D'Aprano writes:
> On Fri, Jan 17, 2014 at 11:19:44AM +0900, Stephen J. Turnbull wrote:
> > "ASCII compatible" is a technical term in encodings, which means
> > "bytes in the range 0-127 always have ASCII coded character semantics,
> > do what you like with bytes in the range 128-255."[1]
>
> Examples, and counter-examples, may help. Let me see if I have got this
> right: an ASCII-compatible encoding may be an ASCII-superset like
> Latin-1, or a variable-width encoding like UTF-8 where the ASCII chars
> are encoded to the same bytes as ASCII, and non-ASCII chars are not. A
> counter-example would be UTF-16, or some of the Asian encodings like
> Big5. Am I right so far?
All correct.
> But Nick isn't talking about an encoding, he's talking about a data
> format. I think that an ASCII-compatible format means one where (in at
> least *some* parts of the data) bytes between 0 and 127 have the same
> meaning as in ASCII, e.g. byte 84 is to be interpreted as ASCII
> character "T". This doesn't mean that every byte 84 means "T", only that
> some of them do -- hopefully a well-defined sections of the data. Below,
> you introduce the term "ASCII segments" for these.
Yes, except that I believe Nick, as well as the "file-and-wire guys",
strengthen "hopefully well-defined" to just "well-defined".
> > <specified bytes methods> are designed for use *only* on bytes
> > that are ASCII segments; use on other data is likely to cause
> > hard-to-diagnose corruption.
>
> An example: if you have the byte b'\x63', calling upper() on that will
> return b'\x43'. That is only meaningful if the byte is intended as the
> ASCII character "c".
Good example.
More information about the Python-Dev
mailing list