[Python-ideas] Fix default encodings on Windows
eryk sun
eryksun at gmail.com
Thu Aug 18 03:27:35 EDT 2016
On Thu, Aug 18, 2016 at 2:32 AM, Stephen J. Turnbull
<turnbull.stephen.fw at u.tsukuba.ac.jp> wrote:
>
> So it's not just invalid surrogate *pairs*, it's invalid surrogates of
> all kinds. This means that it's theoretically possible (though I
> gather that it's unlikely in the extreme) for a real Windows filename
> to indistinguishable from one generated by Python's surrogateescape
> handler.
Absolutely if the filesystem is one of Microsoft's such as NTFS,
FAT32, exFAT, ReFS, NPFS (named pipes), MSFS (mailslots) -- and I'm
pretty sure it's also possible with CDFS and UDFS. UDF allows any
Unicode character except NUL.
> What happens when Python's directory manipulation functions on Windows
> encounter such a filename? Do they try to write it to the disk
> directory? Do they succeed? Does that depend on surrogateescape?
Python allows these 'Unicode' (but not strictly UTF compatible)
strings, so it doesn't have a problem with such filenames, as long as
it's calling the Windows wide-character APIs.
> Is there a reason in practice to allow surrogateescape at all on names
> in Windows filesystems, at least when using the *W API? You mention
> non-Microsoft filesystems; are they common enough to matter?
Previously I gave an example with a VirtualBox shared folder, which
rejects names with invalid surrogates. I don't know how common that is
in general. I typically switch between 2 guests on a Linux host and
share folders between systems. In Windows I mount shared folders as
directory symlinks in C:\Mount.
I just tested another example that led to different results. Ext2Fsd
is a free ext2/ext3 filesystem driver for Windows. I mounted an ext2
disk in Windows 10. Next, in Python I created a file named
"\udc00b\udc00a\udc00d" in the root directory. Ext2Fsd defaults to
using UTF-8 as the drive codepage, so I expected it to reject this
filename, just like VBoxSF does. But it worked:
>>> os.listdir('.')[-1]
'\udc00b\udc00a\udc00d'
As expected the ANSI API substitutes question marks for the surrogate codes:
>>> os.listdir(b'.')[-1]
b'?b?a?d'
So what did Ext2Fsd write in this supposedly UTF-8 filesystem? I
mounted the disk in Linux to check:
>>> os.listdir(b'.')[-1]
b'\xed\xb0\x80b\xed\xb0\x80a\xed\xb0\x80d'
It blindly encoded the surrogate codes, creating invalid UTF-8. I
think it's called WTF-8 (Wobbly Transformation Format). The file
manager in Linux displays this file as "���b���a���d (invalid
encoding)", and ls prints "???b???a???d". Python uses its
surrogateescape error handler:
>>> os.listdir('.')[-1]
'\udced\udcb0\udc80b\udced\udcb0\udc80a\udced\udcb0\udc80d'
The original name can be decoded using the surrogatepass error handler:
>>> os.listdir(b'.')[-1].decode(errors='surrogatepass')
'\udc00b\udc00a\udc00d'
More information about the Python-ideas
mailing list