On Wed, Mar 11, 2020 at 9:05 PM Steven D'Aprano <steve@pearwood.info> wrote:
In practice, modern Unix shells and GUIs use UTF-8. UTF-8 has two nice properties:
* Every ASCII character encodes to a single byte, so text which only contains ASCII values encodes to precisely the same set of bytes under UTF-8 as under ASCII.
* No Unicode character, except for the Unicode NUL '\0', encodes to a sequence containing a null byte.
These properties are not an accident -- they were carefully designed that way.
The second of those is actually part of an even stronger guarantee: No Unicode character except for an ASCII character encodes to a sequence containing a byte less than 128. In other words, the ASCII characters U+0000 to U+007F perfectly correspond to the byte values 0x00 to 0x7F, and *no other UTF-8 sequence* will ever contain one of those byte values. This makes parsing an ASCII-only file format easy. You don't have to worry about, for instance, finding a bye value 0x3C unless it represents "<". (Though if you're taking a more generic boundary like "whitespace", you'll need to cope with more than just bytes. But for something like HTML, this is safe.) Other ASCII-compatible encodings make the same guarantees, although a lot of them do this by having only 128 non-ASCII characters available. ChrisA