
On 10Aug2016 1226, Random832 wrote:
On Wed, Aug 10, 2016, at 15:08, Steve Dower wrote:
Testing with obscure filenames and strings is where help will be needed most :)
How about filenames with invalid surrogates? For added fun, consider that the file system encoding is normally used with surrogateescape.
This is where it gets extra fun, since surrogateescape is not normally used on Windows because we receive paths as Unicode text and pass them back as Unicode text without ever encoding or decoding them.
Currently a broken filename (such as '\udee1.txt') can be correctly seen with os.listdir('.') but not os.listdir(b'.') (because Windows will return it as '?.txt'). It can be passed to open(), but encoding the name to utf-8 or utf-16 fails, and I doubt there's any encoding that is going to succeed.
As far as I can tell, if you get a weird name in bytes today you are broken, and there is no way to be unbroken without doing the actual right thing and converting paths on POSIX into Unicode with surrogateescape. So our official advice has to stay the same - treating paths as text with smuggled bytes is the *only* way to be truly correct. But unless we also deprecate byte paths on POSIX, we'll never get there. (Now there's a dangerous idea ;) )
Cheers, Steve