Re: [Python-Dev] Windows: Remove support of bytes filenames in theos module?

10 Feb 2016

      Steve Dower writes:
...
On 09Feb2016 1801, Andrew Barnert wrote:
...
On Feb 9, 2016, at 17:37, Steve Dower mailto:python@stevedower.id.au> wrote:
...
Could we perhaps redefine bytes paths on Windows as utf8 and use
Unicode everywhere internally?
When you receive bytes from argv, stdin, a text file, a GUI, a named
pipe, etc., and then use them as a path, Python treating them as UTF-8
would break everything.
Sure, but that's already broken today if you're communicating bytes via 
some protocol without manually managing the encoding, in which case you 
should be decoding it (and potentially re-encoding to 
sys.getfilesystemencoding()).
The problem is that treating them as UTF-8 in Python will raise errors
on any file name that isn't valid UTF-8, or corrupt the filename if
you use one of the handlers available in Python 2.

If you use Latin-1, that (1) handles all 256 bytes, and (2) roundtrips
to Unicode.  Although semantically useless to users, if it's just read
from a directory, then used to manipulate file contents, no problem.

Of course if you then edit a multibyte file name as Unicode it is
likely that all hell will break loose.  But you can take that sentence
and s/Unicode/bytes/, too. :-/
...
The problem here is the protocol that Python uses to return bytes paths, 
and that protocol is inconsistent between APIs and information is lost.
No, the problem is that the necessary information simply isn't always
available.  Not even today: think removable media, especially archival
content.  Also network file systems: I don't know if it still happens,
but I've seen Shift JIS, GB2312, and KOI8-R all in the same directory,
and sometimes two of those in the *same path*.  (Don't ask me how
non-malicious users managed to do the latter!)
...
It really requires going through all the OS calls and either (a) making 
them consistently decode bytes to str using the declared FS encoding 
(currently 'mbcs', but I see no reason we can't make it 'utf_8'),
If it were that easy, it would have been done two decades ago.  I'm no
fan of Windows[1], but it's obvious that Microsoft has devoted
enormous amounts of brainpower to the problem of encoding
rationalization since the early 90s.  I don't think they would have
missed this idea.

Footnotes: 
[1]  Its complete inability to DTRT for mixed English and Japanese was
what drove me to Unix-like OSes in the early 90s.

Re: [Python-Dev] Windows: Remove support of bytes filenames in theos module?

Stephen J. Turnbull