Christopher, I'm not sure how much of the following you already know, so excuse me in advance if I'm covering old ground for you. But hopefully it will be helpful for someone! On Tue, Mar 10, 2020 at 01:18:22PM -0700, Christopher Barker wrote:
Getting a bit OT, but I *think* this is the story:
I've heard it argued, by folks that want to write Python software that uses bytes for filenames, that:
A file path on a *nix system can be any string of bytes, except two special values:
b'\x00' : null b'\x2f' : slash
To be precise: file *names* cannot contain null bytes or slashes. Paths can contain slashes, which represent directory separators. To be even more precise: technically this depends on the file system, not the OS. There are a tiny handful of file systems that support NULs and/or slashes in file names, although whether you could actually operate on those files in practice is another story. In practice, the prohibition on NUL and slash is baked so deeply into the Unix world at all levels (OS, shells, applications) that even if you had a file system that supported them, I doubt you would be able to use those characters in file names.
So any encoding will work, as long as those two values mean the right thing. Practically, null is always null, so that leaves the slash So any encoding that uses b'\x2f' for the slash would work. Which seems to include, for instance, UTF-16:
In [31]: "/".encode('utf-16')
Out[31]: b'\xff\xfe/\x00'
You probably don't want the UTF-16 BOM (byte-order-mark) in the name. So you want "/".encode('utf-16le') or perhaps 'utf-16be'. But either way, that won't work, because the path contains a NUL byte. Let's be concrete. Both of these are fine: >>> open('/tmp/spam', 'w') <_io.TextIOWrapper name='/tmp/spam' mode='w' encoding='UTF-8'> >>> open(b'/tmp/spam', 'w') <_io.TextIOWrapper name=b'/tmp/spam' mode='w' encoding='UTF-8'> Both of those represent the same file, because: >>> '/tmp/spam'.encode('utf-8') == b'/tmp/spam' True However, if I use UTF-16, it fails because the file name and path contains NUL bytes: >>> path = '/tmp/spam'.encode('utf-16be') >>> print(path) b'\x00/\x00t\x00m\x00p\x00/\x00s\x00p\x00a\x00m' >>> f = open(path, 'w') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: embedded null byte Here are some fun and games with Unicode... There's no way to get the file b'/tmp/spam' from a UTF-16 string since it has an odd number of bytes, but I can get the file '/tmp/ spam' (note the leading space!) like this: name = ('\N{CJK UNIFIED IDEOGRAPH-742F}' '\N{CJK UNIFIED IDEOGRAPH-706D}' '\N{NARROW NO-BREAK SPACE}' '\N{CJK UNIFIED IDEOGRAPH-7073}' '\N{CJK UNIFIED IDEOGRAPH-6D61}') open(name.encode('utf-16le'), 'w') That will open the file b'/tmp/ spam'. If that doesn't make you pine for the simpler days when the whole computing world used nothing but American English and liked it, then nothing will :-) [...]
In practice, I suspect that every *nix system uses encoding(s) that are ASCII-compatible for the first 127 values, or more precisely that have the slash character value be a single byte of value: 2f. And as long as that's the case (and that value doesn't show up anywhere else) then software can know nothing about the encoding, and still do two things:
pass it around split on the slash.
In practice, modern Unix shells and GUIs use UTF-8. UTF-8 has two nice properties: * Every ASCII character encodes to a single byte, so text which only contains ASCII values encodes to precisely the same set of bytes under UTF-8 as under ASCII. * No Unicode character, except for the Unicode NUL '\0', encodes to a sequence containing a null byte. These properties are not an accident -- they were carefully designed that way. In practice, the Unix OS doesn't care what encoding the user interface uses. It deals in bytes, and that's all it cares about. Only the interface cares about the encoding, because if you name a file "Mετăl" in the shell, you expect to see the same file name in the open file dialog of all your GUI applications. (And visa versa.) So all the various shells, GUIs etc have to agree to use the same encoding or users get mad.
Which may be enough for various system tools. So a fine argument for facilitating their use in things like globbing directories, opening files, etc.
But as soon as you have any interaction with humans, then filenames need to be human meaningful. and as soon as you manipulate the names beyond splitting and merging on a slash, then you do need to know something about the encoding.
Well... yes and no. In practice, Unix users don't care any more about encodings than Windows users do. Possibly less! Windows users have to deal with legacy systems that can use any of the hundreds of pre-Unicode "extended ASCII" 8 bit systems. If Stephen Turnbull is reading this, he can probably tell you scary stories about the chaos in pre-Unicode Japanese encodings like Big-5 and Shift-JS. In the Linux/BSD world, pretty much everyone uses UTF-8 unless you're doing something unusual, like trying to extract data from some old Russian CSV file or ancient Macintosh text file. In the shell, I couldn't care less about the encoding. I just name files whatever I want, and the shell deals with them: [steve@ando ~]$ touch Mετăl [steve@ando ~]$ ls -l M*l -rw-rw-r-- 1 steve steve 0 Mar 11 20:46 Mετăl (Alas, typing those non-ASCII characters is a PITA, I ended up having to enter them using a GUI "Character Map" application and paste them into the shell.) In Python, I never worry about encodings. I just use regular old Unicode strings: open('Mετăl') and it Just Works. I would expect that the majority of Unix users will be in the same boat. I think that the exception will be people writing applications that have to straddle the low-level "Unix file names are bytes" and high-level "you can use anything you can type as a file name" worlds. But bytes are useful for more than just file names! Anyone writing a binary file format needs to deal with bytes, and I'm confident that there are binary formats that have optional prefixes and suffixes that might need to be stripped before doing further processing. if chunk.startswith(b'DEADBEEF'): chunk = chunk[8:] process(chunk) One possible example: you have some binary data which may or may not have a NUL byte at the end. The NUL byte is redundant in Python (we're not C) so you want to delete it: if data.endswith(b'\0'): data = data[:-1] -- Steven