[Python-ideas] Re: prefix/suffix for bytes (was: New explicit methods to trim strings)

11 Mar 2020

      On Wed, Mar 11, 2020 at 9:05 PM Steven D'Aprano <steve@pearwood.info> wrote:
...
In practice, modern Unix shells and GUIs use UTF-8. UTF-8 has two nice
properties:
* Every ASCII character encodes to a single byte, so text which
  only contains ASCII values encodes to precisely the same set
  of bytes under UTF-8 as under ASCII.
* No Unicode character, except for the Unicode NUL '\0', encodes
  to a sequence containing a null byte.
These properties are not an accident -- they were carefully designed
that way.
The second of those is actually part of an even stronger guarantee: No
Unicode character except for an ASCII character encodes to a sequence
containing a byte less than 128. In other words, the ASCII characters
U+0000 to U+007F perfectly correspond to the byte values 0x00 to 0x7F,
and *no other UTF-8 sequence* will ever contain one of those byte
values.

This makes parsing an ASCII-only file format easy. You don't have to
worry about, for instance, finding a bye value 0x3C unless it
represents "<". (Though if you're taking a more generic boundary like
"whitespace", you'll need to cope with more than just bytes. But for
something like HTML, this is safe.)

Other ASCII-compatible encodings make the same guarantees, although a
lot of them do this by having only 128 non-ASCII characters available.

ChrisA

[Python-ideas] Re: prefix/suffix for bytes (was: New explicit methods to trim strings)

Chris Angelico