[Python-Dev] Windows: Remove support of bytes filenames in theos module?

Stephen J. Turnbull stephen at xemacs.org
Tue Feb 9 23:17:48 EST 2016


Steve Dower writes:
 > On 09Feb2016 1801, Andrew Barnert wrote:
 > > On Feb 9, 2016, at 17:37, Steve Dower <python at stevedower.id.au
 > > <mailto:python at stevedower.id.au>> wrote:
 > >
 > >> Could we perhaps redefine bytes paths on Windows as utf8 and use
 > >> Unicode everywhere internally?
 > >
 > > When you receive bytes from argv, stdin, a text file, a GUI, a named
 > > pipe, etc., and then use them as a path, Python treating them as UTF-8
 > > would break everything.
 > 
 > Sure, but that's already broken today if you're communicating bytes via 
 > some protocol without manually managing the encoding, in which case you 
 > should be decoding it (and potentially re-encoding to 
 > sys.getfilesystemencoding()).

The problem is that treating them as UTF-8 in Python will raise errors
on any file name that isn't valid UTF-8, or corrupt the filename if
you use one of the handlers available in Python 2.

If you use Latin-1, that (1) handles all 256 bytes, and (2) roundtrips
to Unicode.  Although semantically useless to users, if it's just read
from a directory, then used to manipulate file contents, no problem.

Of course if you then edit a multibyte file name as Unicode it is
likely that all hell will break loose.  But you can take that sentence
and s/Unicode/bytes/, too. :-/

 > The problem here is the protocol that Python uses to return bytes paths, 
 > and that protocol is inconsistent between APIs and information is lost.

No, the problem is that the necessary information simply isn't always
available.  Not even today: think removable media, especially archival
content.  Also network file systems: I don't know if it still happens,
but I've seen Shift JIS, GB2312, and KOI8-R all in the same directory,
and sometimes two of those in the *same path*.  (Don't ask me how
non-malicious users managed to do the latter!)

 > It really requires going through all the OS calls and either (a) making 
 > them consistently decode bytes to str using the declared FS encoding 
 > (currently 'mbcs', but I see no reason we can't make it 'utf_8'),

If it were that easy, it would have been done two decades ago.  I'm no
fan of Windows[1], but it's obvious that Microsoft has devoted
enormous amounts of brainpower to the problem of encoding
rationalization since the early 90s.  I don't think they would have
missed this idea.



Footnotes: 
[1]  Its complete inability to DTRT for mixed English and Japanese was
what drove me to Unix-like OSes in the early 90s.



More information about the Python-Dev mailing list