On Thu, Aug 18, 2016 at 2:32 AM, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
So it's not just invalid surrogate *pairs*, it's invalid surrogates of all kinds. This means that it's theoretically possible (though I gather that it's unlikely in the extreme) for a real Windows filename to indistinguishable from one generated by Python's surrogateescape handler.
Absolutely if the filesystem is one of Microsoft's such as NTFS, FAT32, exFAT, ReFS, NPFS (named pipes), MSFS (mailslots) -- and I'm pretty sure it's also possible with CDFS and UDFS. UDF allows any Unicode character except NUL.
What happens when Python's directory manipulation functions on Windows encounter such a filename? Do they try to write it to the disk directory? Do they succeed? Does that depend on surrogateescape?
Python allows these 'Unicode' (but not strictly UTF compatible) strings, so it doesn't have a problem with such filenames, as long as it's calling the Windows wide-character APIs.
Is there a reason in practice to allow surrogateescape at all on names in Windows filesystems, at least when using the *W API? You mention non-Microsoft filesystems; are they common enough to matter?
Previously I gave an example with a VirtualBox shared folder, which rejects names with invalid surrogates. I don't know how common that is in general. I typically switch between 2 guests on a Linux host and share folders between systems. In Windows I mount shared folders as directory symlinks in C:\Mount. I just tested another example that led to different results. Ext2Fsd is a free ext2/ext3 filesystem driver for Windows. I mounted an ext2 disk in Windows 10. Next, in Python I created a file named "\udc00b\udc00a\udc00d" in the root directory. Ext2Fsd defaults to using UTF-8 as the drive codepage, so I expected it to reject this filename, just like VBoxSF does. But it worked: >>> os.listdir('.')[-1] '\udc00b\udc00a\udc00d' As expected the ANSI API substitutes question marks for the surrogate codes: >>> os.listdir(b'.')[-1] b'?b?a?d' So what did Ext2Fsd write in this supposedly UTF-8 filesystem? I mounted the disk in Linux to check: >>> os.listdir(b'.')[-1] b'\xed\xb0\x80b\xed\xb0\x80a\xed\xb0\x80d' It blindly encoded the surrogate codes, creating invalid UTF-8. I think it's called WTF-8 (Wobbly Transformation Format). The file manager in Linux displays this file as "���b���a���d (invalid encoding)", and ls prints "???b???a???d". Python uses its surrogateescape error handler: >>> os.listdir('.')[-1] '\udced\udcb0\udc80b\udced\udcb0\udc80a\udced\udcb0\udc80d' The original name can be decoded using the surrogatepass error handler: >>> os.listdir(b'.')[-1].decode(errors='surrogatepass') '\udc00b\udc00a\udc00d'