[Python-Dev] Byte filenames in the posix module on Windows
victor.stinner at haypocalc.com
Wed Jun 8 00:23:20 CEST 2011
Last november, we "decided" (right?) to deprecate bytes filenames in the posix
module on Windows in Python 3.2 and drop the support in 3.3: see "Removal of
Win32 ANSI API" thread on python-dev. Python 3.2 has been released, so we
should shift the versions numbers.
I would like to take care of this. I propose three steps:
1) Remove the bytes implementation of each function (when the code is not
shared with other OSes), decode bytes from MBCS and reuse the Unicode code.
2) Deprecate the bytes code in Python 3.3
3) Drop bytes support in Python 3.4
(I'm only talking about the posix module on Windows)
The first step should not change anything for the user, but it will remove a
lot of duplicated code. I expect something like removing the half of the code
specific to Windows in the posix module.
If we decide to keep bytes filenames on Windows, we can stop before the second
One important point is the choice of the error handler: I would like to mimic
the ANSI API and so I will use MultiByteToWideChar() with flags=0 (e.g. MBCS
codec with ignore error handler, but see also the issue #12281 !). The MBCS
codec uses the ANSI code page which can be a multibyte encoding, like ShiftJIS
(cp932 with a japanese locale).
os.fsdecode(), PyUnicode_DecodeFSDefault() and PyUnicode_FSDecoder() use the
strict error handler to decode filenames on Windows. We may also use strict in
the posix module. I'm +0 for this because it warns the developer (and user?)
that he/she is doing something really bad.
I would like to simplify posixmodule.c because I saw that it is difficult to
patch it to fix bugs (you have to patch two functions, or more, for each fix). A
recent example is the os.stat() symlink issue on Windows:
Since Windows 2000, filenames are stored as Unicode internally (e.g. VFAT and
NTFS use UTF-16), the ANSI API was kept for backward compatibility (for lazy
developers!). If you use bytes for filenames on Windows, you may get encode
errors because the ANSI code page is a small subset of Unicode.
For your information, I have a last -huge- pending patch to only use Unicode
in the import machinery, issue #11619 ;-) Using this patch, you can use
characters not encodable to the ANSI code page in your module name/path, yeah!
But I am not completly convinced that we need this patch...
More information about the Python-Dev