[Python-ideas] Re: prefix/suffix for bytes (was: New explicit methods to trim strings)

March 10, 2020

      On Mar 10, 2020, at 13:18, Christopher Barker <pythonchb@gmail.com> wrote:
...
Getting a bit OT, but I *think* this is the story:
I've heard it argued, by folks that want to write Python software that uses bytes for filenames, that:
A file path on a *nix system can be any string of bytes, except two special values:
b'\x00'   : null
b'\x2f'    : slash
(consistent with this SO post, among many other sources: https://unix.stackexchange.com/questions/39175/understanding-unix-file-name-...)
So any encoding will work, as long as those two values mean the right thing. Practically, null is always null, so that leaves the slash
No; there are plenty of encodings where a 0 byte doesn’t always mean NUL. And in fact, that’s exactly the problem with UTF-16: every ASCII character in UTF-16 is the same byte preceded or followed (depending on endianness) by a 0 byte. And you aren’t allowed to have arbitrary 0 bytes like that in your paths.
...
So any encoding that uses b'\x2f' for the slash would work.
Even besides the zero problem, it has to not only always use 0x2f for slash, but also never use 0x2f for anything else. This was a problem for many earlier East Asian encodings, where a slash is 0x2f, but some kanji character is also 0x93 0x2f, or some kana character is 0x2f after a mode shift, etc. In such cases, every 0x2f byte gets treated as a path separator, even the ones that don’t mean slash.

There are encodings that are not ASCII compatible that nevertheless guarantee that 0x00 always means NUL and vice versus and that 0x2f always means slash and vice-versa, like Shift-JIS. Many of them will cause problems in the shell, file manager GUIs, etc., but that’s a different part of the specification (and Unix already allows you to have non printable, etc. characters in file names, so that problem is there even with ASCII). Many of them also aren’t usable for pathnames on other platforms (e.g., Shift-JIS does guarantee that 0x2f always means slash, but 0x5c doesn’t always mean backslash; it means yen or the second half of various kanji, so you don’t want to use it for byte paths on Windows). But for Unix pathnames, they are usable. But again, UTF-16 is not one of them.
...
Which seems to include, for instance, UTF-16:
In [31]: "/".encode('utf-16')                                                  
Out[31]: b'\xff\xfe/\x00'
In this case, you will get very lucky—or, maybe better, unlucky. This is illegal, but in practice no API can detect that it’s illegal, because all of the POSIX and libc functions and most third-party functions just take a null-terminated string, meaning they silently truncate right after the first Latin-1 character, and your string is exactly one Latin-1 character long.

[Python-ideas] Re: prefix/suffix for bytes (was: New explicit methods to trim strings)

Andrew Barnert