[Python-ideas] Fix default encodings on Windows
Steve Dower
steve.dower at python.org
Wed Aug 10 15:39:19 EDT 2016
On 10Aug2016 1226, Random832 wrote:
> On Wed, Aug 10, 2016, at 15:08, Steve Dower wrote:
>> Testing with obscure filenames and strings is where help will be needed
>> most :)
>
> How about filenames with invalid surrogates? For added fun, consider
> that the file system encoding is normally used with surrogateescape.
This is where it gets extra fun, since surrogateescape is not normally
used on Windows because we receive paths as Unicode text and pass them
back as Unicode text without ever encoding or decoding them.
Currently a broken filename (such as '\udee1.txt') can be correctly seen
with os.listdir('.') but not os.listdir(b'.') (because Windows will
return it as '?.txt'). It can be passed to open(), but encoding the name
to utf-8 or utf-16 fails, and I doubt there's any encoding that is going
to succeed.
As far as I can tell, if you get a weird name in bytes today you are
broken, and there is no way to be unbroken without doing the actual
right thing and converting paths on POSIX into Unicode with
surrogateescape. So our official advice has to stay the same - treating
paths as text with smuggled bytes is the *only* way to be truly correct.
But unless we also deprecate byte paths on POSIX, we'll never get there.
(Now there's a dangerous idea ;) )
Cheers,
Steve
More information about the Python-ideas
mailing list