Steve Dower writes:
On 09Feb2016 1801, Andrew Barnert wrote:
On Feb 9, 2016, at 17:37, Steve Dower
mailto:python@stevedower.id.au> wrote: Could we perhaps redefine bytes paths on Windows as utf8 and use Unicode everywhere internally?
When you receive bytes from argv, stdin, a text file, a GUI, a named pipe, etc., and then use them as a path, Python treating them as UTF-8 would break everything.
Sure, but that's already broken today if you're communicating bytes via some protocol without manually managing the encoding, in which case you should be decoding it (and potentially re-encoding to sys.getfilesystemencoding()).
The problem is that treating them as UTF-8 in Python will raise errors on any file name that isn't valid UTF-8, or corrupt the filename if you use one of the handlers available in Python 2. If you use Latin-1, that (1) handles all 256 bytes, and (2) roundtrips to Unicode. Although semantically useless to users, if it's just read from a directory, then used to manipulate file contents, no problem. Of course if you then edit a multibyte file name as Unicode it is likely that all hell will break loose. But you can take that sentence and s/Unicode/bytes/, too. :-/
The problem here is the protocol that Python uses to return bytes paths, and that protocol is inconsistent between APIs and information is lost.
No, the problem is that the necessary information simply isn't always available. Not even today: think removable media, especially archival content. Also network file systems: I don't know if it still happens, but I've seen Shift JIS, GB2312, and KOI8-R all in the same directory, and sometimes two of those in the *same path*. (Don't ask me how non-malicious users managed to do the latter!)
It really requires going through all the OS calls and either (a) making them consistently decode bytes to str using the declared FS encoding (currently 'mbcs', but I see no reason we can't make it 'utf_8'),
If it were that easy, it would have been done two decades ago. I'm no fan of Windows[1], but it's obvious that Microsoft has devoted enormous amounts of brainpower to the problem of encoding rationalization since the early 90s. I don't think they would have missed this idea. Footnotes: [1] Its complete inability to DTRT for mixed English and Japanese was what drove me to Unix-like OSes in the early 90s.