[Python-Dev] Byte filenames in the posix module on Windows

Wed Jun 8 00:23:20 CEST 2011

Hi,

Last november, we "decided" (right?) to deprecate bytes filenames in the posix 
module on Windows in Python 3.2 and drop the support in 3.3: see "Removal of 
Win32 ANSI API" thread on python-dev. Python 3.2 has been released, so we  
should shift the versions numbers.

I would like to take care of this. I propose three steps:

1) Remove the bytes implementation of each function (when the code is not 
shared with other OSes), decode bytes from MBCS and reuse the Unicode code.

2) Deprecate the bytes code in Python 3.3

3) Drop bytes support in Python 3.4

(I'm only talking about the posix module on Windows)

The first step should not change anything for the user, but it will remove a 
lot of duplicated code. I expect something like removing the half of the code 
specific to Windows in the posix module.

If we decide to keep bytes filenames on Windows, we can stop before the second 
step.

--

One important point is the choice of the error handler: I would like to mimic 
the ANSI API and so I will use MultiByteToWideChar() with flags=0 (e.g. MBCS 
codec with ignore error handler, but see also the issue #12281 !). The MBCS 
codec uses the ANSI code page which can be a multibyte encoding, like ShiftJIS 
(cp932 with a japanese locale).

os.fsdecode(), PyUnicode_DecodeFSDefault() and PyUnicode_FSDecoder() use the 
strict error handler to decode filenames on Windows. We may also use strict in 
the posix module. I'm +0 for this because it warns the developer (and user?) 
that he/she is doing something really bad.

--

I would like to simplify posixmodule.c because I saw that it is difficult to 
patch it to fix bugs (you have to patch two functions, or more, for each fix). A 
recent example is the os.stat() symlink issue on Windows:
http://bugs.python.org/issue12084

Since Windows 2000, filenames are stored as Unicode internally (e.g. VFAT and 
NTFS use UTF-16), the ANSI API was kept for backward compatibility (for lazy 
developers!). If you use bytes for filenames on Windows, you may get encode 
errors because the ANSI code page is a small subset of Unicode.

For your information, I have a last -huge- pending patch to only use Unicode 
in the import machinery, issue #11619 ;-) Using this patch, you can use 
characters not encodable to the ANSI code page in your module name/path, yeah! 
But I am not completly convinced that we need this patch...

Victor