On Mar 10, 2020, at 08:01, David Mertz <mertz@gnosis.cx> wrote:

Most real-world UNIX systems only support ASCII-compatible encodings. There's no reason not to solve the problem on such systems by using os.fsdecode().

Huh?!

Is my Ubuntu derivative not "real world"?

666-tmp % uname -a
Linux popkdm 5.3.0-7629-generic #31~1581628825~19.10~f90b7d5-Ubuntu SMP Fri Feb 14 19:56:45 UTC  x86_64 x86_64 x86_64 GNU/Linux
667-tmp % touch ✗—Not-ASCII
668-tmp % ls ✗*
✗—Not-ASCII

Technically your Ubuntu derivative is not a real-world UNIX system, because it’s not a UNIX system. Only a handful of Linux distros bother to be certified, because it’s not worth the cost unless you need to sell to some corporate or government department who have some regulation requiring UNIX.

And practically, I’m pretty sure that’s UTF-8, which is ASCII-compatible: every byte from 0-127 always means the same thing as it does in ASCII. This means you can, e.g., do path.split(os.pathsep.encode('ascii')) and know you’re getting the right behavior. The same thing works for Latin-1 and friends, and the IBM code pages in the “extended ASCII” group, and so on—those are the kinds of things Random was presumably talking about, because they are commonly used in real-world UNIX systems.

There are also things that are not ASCII-compatible but are close. For example, in Shift-JIS, a couple low bytes have a different meaning than in ASCII, and many of them can also appear as part of a 2-byte character—but ASCII NUL and slash still always mean NUL and slash, so you can use it for your Linux filesystems. (Although you will have a lot of trouble in the shell, because your backslash escape is now a yen escape, and 64 other characters have the same byte invisibly as their second byte.)

Things that are not even that ASCII-compatible include UTF-16, EBCDIC code pages, 80s Atari encoding, etc.; they are not commonly used in real-world UNIX systems. Which I think was Random’s point.