
On Wed, Aug 17, 2016 at 9:35 AM, Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
BTW, why "surrogate pairs"? Does Windows validate surrogates to ensure they come in pairs, but not necessarily in the right order (or perhaps sometimes they resolve to non-characters such as U+1FFFF)?
A program can pass the filesystem a name containing one or more surrogate codes that isn't in a valid UTF-16 surrogate pair (i.e. a leading code in the range D800-DBFF followed by a trailing code in the range DC00-DFFF). In the user-mode runtime library and kernel executive, nothing up to the filesystem driver checks for a valid UTF-16 string. Microsoft's filesystems remain compatible with UCS2 from the 90s and don't care that the name isn't legal UTF-16. The same goes for the in-memory filesystems used for named pipes (NPFS, \\.\pipe) and mailslots (MSFS, \\.\mailslot). But non-Microsoft filesystems don't necessarily store names as wide-character strings. They may use UTF-8, in which case an invalid UTF-16 name will cause the system call to fail because it's an invalid parameter. If the filesystem allows creating such a badly named file or directory, it can still be accessed using a regular unicode path, which is how things stand currently. I see that Victor has suggested using "surrogatepass" in issue 27781. That would allow seamless operation. The downside is that bytes have a higher chance of leaking out of Python than strings created by 'surrogateescape' on Unix. But since it isn't a proper Unicode string on disk, at least nothing has changed substantively by transcoding to "surrogatepass" UTF-8.