New proposition for Python3 bytes filename issue

Hi, After reading the previous discussion, here is new proposition. Python 2.x and Windows are not affected by this issue. Only Python3 on POSIX (eg. Linux or *BSD) is affected. Some system are broken, but Python have to be able to open/copy/move/remove files with an "invalid filename". The issue can wait for Python 3.0.1 / 3.1. Windows ------- On Windows, we might reject bytes filenames for all file operations: open(), unlink(), os.path.join(), etc. (raise a TypeError or UnicodeError) POSIX OS -------- The default behaviour should be to use unicode and raise an error if conversion to unicode fails. It should also be possible to use bytes using bytes arguments and optional arguments (for getcwd). - listdir(unicode) -> unicode and raise an error on invalid filename - listdir(bytes) -> bytes - getcwd() -> unicode - getcwd(bytes=True) -> bytes - open(): accept bytes or unicode os.path.*() should accept operations on bytes filenames, but maybe not on bytes+unicode arguments. os.path.join('directory', b'filename'): raise an error (or use *implicit* conversion to bytes)? When the user wants to display a filename to the screen, he can uses: text = str(filename, fs_encoding, "replace") -- Victor Stinner aka haypo http://www.haypocalc.com/blog/

Patches are already avaible in the issue #3187 (os.listdir): Le Monday 29 September 2008 14:07:55 Victor Stinner, vous avez écrit :
- listdir(unicode) -> unicode and raise an error on invalid filename
Need raise_decoding_errors.patch (don't clear Unicode error
- listdir(bytes) -> bytes
Always working.
- getcwd() -> unicode - getcwd(bytes=True) -> bytes
Need merge_os_getcwd_getcwdu.patch Note that current implement of getcwd() uses PyUnicode_FromString() to encode the directory, whereas getcwdu() uses the correct code (PyUnicode_Decode). So I merged both functions to keep only the correct version: getcwdu() => getcwd().
- open(): accept bytes or unicode
Need io_byte_filename.patch (just remove a check)
os.path.*() should accept operations on bytes filenames, but maybe not on bytes+unicode arguments. os.path.join('directory', b'filename'): raise an error (or use *implicit* conversion to bytes)?
os.path.join() already reject mixing bytes + str. But os.path.join(), glob.glob(), fnmatch.*(), etc. doesn't support bytes. I wrote some patches like: - glob1_bytes.patch: Fix glob.glob() to accept invalid directory name - fnmatch_bytes.patch: Patch fnmatch.filter() to accept bytes filenames But I dislike both patches since they mix bytes and str. So this part still need some work. -- Victor Stinner aka haypo http://www.haypocalc.com/blog/

On Mon, Sep 29, 2008 at 6:07 AM, Victor Stinner <victor.stinner@haypocalc.com> wrote:
The default behaviour should be to use unicode and raise an error if conversion to unicode fails. It should also be possible to use bytes using bytes arguments and optional arguments (for getcwd).
- listdir(unicode) -> unicode and raise an error on invalid filename - listdir(bytes) -> bytes - getcwd() -> unicode - getcwd(bytes=True) -> bytes
Please let's not introduce boolean flags like this. How about ``getcwdb`` in parallel with the old ``getcwdu``? Steve -- I'm not *in*-sane. Indeed, I am so far *out* of sane that you appear a tiny blip on the distant coast of sanity. --- Bucky Katt, Get Fuzzy

Le Monday 29 September 2008 17:16:47 Steven Bethard, vous avez écrit :
- getcwd() -> unicode - getcwd(bytes=True) -> bytes
Please let's not introduce boolean flags like this. How about ``getcwdb`` in parallel with the old ``getcwdu``?
Yeah, you're right. So i wrote a new patch: os_getcwdb.patch With my patch we get (Python3): * os.getcwd() -> unicode * os.getcwdb() -> bytes Previously in Python2 it was: * os.getcwd() -> str (bytes) * os.getcwdu() -> unicode -- Victor Stinner aka haypo http://www.haypocalc.com/blog/

On Mon, Sep 29, 2008 at 10:00 AM, Victor Stinner <victor.stinner@haypocalc.com> wrote:
Le Monday 29 September 2008 17:16:47 Steven Bethard, vous avez écrit :
- getcwd() -> unicode - getcwd(bytes=True) -> bytes
Please let's not introduce boolean flags like this. How about ``getcwdb`` in parallel with the old ``getcwdu``?
Yeah, you're right. So i wrote a new patch: os_getcwdb.patch
With my patch we get (Python3): * os.getcwd() -> unicode * os.getcwdb() -> bytes
Previously in Python2 it was: * os.getcwd() -> str (bytes) * os.getcwdu() -> unicode
Why not do: * os.getcwd() -> unicode * posix.getcwdb() -> bytes os gets the standard version and posix has an (unambiguously named) platform-specific version. -- Adam Olsen, aka Rhamphoryncus

The default behaviour should be to use unicode and raise an error if conversion to unicode fails. It should also be possible to use bytes using bytes arguments and optional arguments (for getcwd).
I'm still opposed to allowing bytes as file names at all in 3k. Python should really strive for providing a uniform datatype, and that should be the character string type. For applications that cannot trust that the conversion works always correctly on POSIX systems, sys.setfilesystemencoding should be provided. In the long run, need for explicit calls to this function should be reduced, by a) systems getting more consistent in their file name encoding, and b) Python providing better defaults for detecting the file name encoding, and better round-trip support for non-encodable bytes. Part b) is probably out-of-scope for 3.0 now, but should be reconsidered for 3.1 Regards, Martin
participants (4)
-
"Martin v. Löwis"
-
Adam Olsen
-
Steven Bethard
-
Victor Stinner