[Python-Dev] PEP 461 updates

Stephen J. Turnbull stephen at xemacs.org
Fri Jan 17 10:59:30 CET 2014


Steven D'Aprano writes:
 > On Fri, Jan 17, 2014 at 11:19:44AM +0900, Stephen J. Turnbull wrote:

 > > "ASCII compatible" is a technical term in encodings, which means
 > > "bytes in the range 0-127 always have ASCII coded character semantics,
 > > do what you like with bytes in the range 128-255."[1]
 > 
 > Examples, and counter-examples, may help. Let me see if I have got this 
 > right: an ASCII-compatible encoding may be an ASCII-superset like 
 > Latin-1, or a variable-width encoding like UTF-8 where the ASCII chars 
 > are encoded to the same bytes as ASCII, and non-ASCII chars are not. A 
 > counter-example would be UTF-16, or some of the Asian encodings like 
 > Big5. Am I right so far?

All correct.

 > But Nick isn't talking about an encoding, he's talking about a data 
 > format. I think that an ASCII-compatible format means one where (in at 
 > least *some* parts of the data) bytes between 0 and 127 have the same 
 > meaning as in ASCII, e.g. byte 84 is to be interpreted as ASCII 
 > character "T". This doesn't mean that every byte 84 means "T", only that 
 > some of them do -- hopefully a well-defined sections of the data. Below, 
 > you introduce the term "ASCII segments" for these.

Yes, except that I believe Nick, as well as the "file-and-wire guys",
strengthen "hopefully well-defined" to just "well-defined".

 > >     <specified bytes methods> are designed for use *only* on bytes
 > >     that are ASCII segments; use on other data is likely to cause
 > >     hard-to-diagnose corruption.
 > 
 > An example: if you have the byte b'\x63', calling upper() on that will 
 > return b'\x43'. That is only meaningful if the byte is intended as the 
 > ASCII character "c".

Good example.


More information about the Python-Dev mailing list