[Python-ideas] Fix default encodings on Windows

Thu Aug 18 03:27:35 EDT 2016

On Thu, Aug 18, 2016 at 2:32 AM, Stephen J. Turnbull
<turnbull.stephen.fw at u.tsukuba.ac.jp> wrote:
>
> So it's not just invalid surrogate *pairs*, it's invalid surrogates of
> all kinds.  This means that it's theoretically possible (though I
> gather that it's unlikely in the extreme) for a real Windows filename
> to indistinguishable from one generated by Python's surrogateescape
> handler.

Absolutely if the filesystem is one of Microsoft's such as NTFS,
FAT32, exFAT, ReFS, NPFS (named pipes), MSFS (mailslots) -- and I'm
pretty sure it's also possible with CDFS and UDFS. UDF allows any
Unicode character except NUL.

> What happens when Python's directory manipulation functions on Windows
> encounter such a filename?  Do they try to write it to the disk
> directory?  Do they succeed?  Does that depend on surrogateescape?

Python allows these 'Unicode' (but not strictly UTF compatible)
strings, so it doesn't have a problem with such filenames, as long as
it's calling the Windows wide-character APIs.

> Is there a reason in practice to allow surrogateescape at all on names
> in Windows filesystems, at least when using the *W API?  You mention
> non-Microsoft filesystems; are they common enough to matter?

Previously I gave an example with a VirtualBox shared folder, which
rejects names with invalid surrogates. I don't know how common that is
in general. I typically switch between 2 guests on a Linux host and
share folders between systems. In Windows I mount shared folders as
directory symlinks in C:\Mount.

I just tested another example that led to different results. Ext2Fsd
is a free ext2/ext3 filesystem driver for Windows. I mounted an ext2
disk in Windows 10. Next, in Python I created a file named
"\udc00b\udc00a\udc00d" in the root directory. Ext2Fsd defaults to
using UTF-8 as the drive codepage, so I expected it to reject this
filename, just like VBoxSF does. But it worked:

    >>> os.listdir('.')[-1]
    '\udc00b\udc00a\udc00d'

As expected the ANSI API substitutes question marks for the surrogate codes:

    >>> os.listdir(b'.')[-1]
    b'?b?a?d'

So what did Ext2Fsd write in this supposedly UTF-8 filesystem? I
mounted the disk in Linux to check:

    >>> os.listdir(b'.')[-1]
    b'\xed\xb0\x80b\xed\xb0\x80a\xed\xb0\x80d'

It blindly encoded the surrogate codes, creating invalid UTF-8. I
think it's called WTF-8 (Wobbly Transformation Format). The file
manager in Linux displays this file as "���b���a���d (invalid
encoding)", and ls prints "???b???a???d". Python uses its
surrogateescape error handler:

    >>> os.listdir('.')[-1]
    '\udced\udcb0\udc80b\udced\udcb0\udc80a\udced\udcb0\udc80d'

The original name can be decoded using the surrogatepass error handler:

    >>> os.listdir(b'.')[-1].decode(errors='surrogatepass')
    '\udc00b\udc00a\udc00d'