[Python-Dev] Windows: Remove support of bytes filenames in theos module?
Stephen J. Turnbull
stephen at xemacs.org
Tue Feb 9 23:17:48 EST 2016
Steve Dower writes:
> On 09Feb2016 1801, Andrew Barnert wrote:
> > On Feb 9, 2016, at 17:37, Steve Dower <python at stevedower.id.au
> > <mailto:python at stevedower.id.au>> wrote:
> >
> >> Could we perhaps redefine bytes paths on Windows as utf8 and use
> >> Unicode everywhere internally?
> >
> > When you receive bytes from argv, stdin, a text file, a GUI, a named
> > pipe, etc., and then use them as a path, Python treating them as UTF-8
> > would break everything.
>
> Sure, but that's already broken today if you're communicating bytes via
> some protocol without manually managing the encoding, in which case you
> should be decoding it (and potentially re-encoding to
> sys.getfilesystemencoding()).
The problem is that treating them as UTF-8 in Python will raise errors
on any file name that isn't valid UTF-8, or corrupt the filename if
you use one of the handlers available in Python 2.
If you use Latin-1, that (1) handles all 256 bytes, and (2) roundtrips
to Unicode. Although semantically useless to users, if it's just read
from a directory, then used to manipulate file contents, no problem.
Of course if you then edit a multibyte file name as Unicode it is
likely that all hell will break loose. But you can take that sentence
and s/Unicode/bytes/, too. :-/
> The problem here is the protocol that Python uses to return bytes paths,
> and that protocol is inconsistent between APIs and information is lost.
No, the problem is that the necessary information simply isn't always
available. Not even today: think removable media, especially archival
content. Also network file systems: I don't know if it still happens,
but I've seen Shift JIS, GB2312, and KOI8-R all in the same directory,
and sometimes two of those in the *same path*. (Don't ask me how
non-malicious users managed to do the latter!)
> It really requires going through all the OS calls and either (a) making
> them consistently decode bytes to str using the declared FS encoding
> (currently 'mbcs', but I see no reason we can't make it 'utf_8'),
If it were that easy, it would have been done two decades ago. I'm no
fan of Windows[1], but it's obvious that Microsoft has devoted
enormous amounts of brainpower to the problem of encoding
rationalization since the early 90s. I don't think they would have
missed this idea.
Footnotes:
[1] Its complete inability to DTRT for mixed English and Japanese was
what drove me to Unix-like OSes in the early 90s.
More information about the Python-Dev
mailing list