On Wed, Aug 17, 2016 at 9:35 AM, Stephen J. Turnbull
BTW, why "surrogate pairs"? Does Windows validate surrogates to ensure they come in pairs, but not necessarily in the right order (or perhaps sometimes they resolve to non-characters such as U+1FFFF)?
A program can pass the filesystem a name containing one or more surrogate codes that isn't in a valid UTF-16 surrogate pair (i.e. a leading code in the range D800-DBFF followed by a trailing code in the range DC00-DFFF). In the user-mode runtime library and kernel executive, nothing up to the filesystem driver checks for a valid UTF-16 string. Microsoft's filesystems remain compatible with UCS2 from the 90s and don't care that the name isn't legal UTF-16. The same goes for the in-memory filesystems used for named pipes (NPFS, \\.\pipe) and mailslots (MSFS, \\.\mailslot). But non-Microsoft filesystems don't necessarily store names as wide-character strings. They may use UTF-8, in which case an invalid UTF-16 name will cause the system call to fail because it's an invalid parameter. If the filesystem allows creating such a badly named file or directory, it can still be accessed using a regular unicode path, which is how things stand currently. I see that Victor has suggested using "surrogatepass" in issue 27781. That would allow seamless operation. The downside is that bytes have a higher chance of leaking out of Python than strings created by 'surrogateescape' on Unix. But since it isn't a proper Unicode string on disk, at least nothing has changed substantively by transcoding to "surrogatepass" UTF-8.