Introduce some obvious way to encode and decode filenames from Python code

Currently, there is no obvious way to encode a filename in the default filesystem encoding. To pipe some filenames to the stdin of a subprocess, I effectively used encoded_name = file_name.encode(sys.getfilesystemencoding()) which mostly worked. There are cases where this fails, though: on Linux with LANG=C and filenames that contain non-ASCII characters, for example, or in any situation where the default filesystem encoding can't decode a filename. The correct way to do this seems to be something like if sys.platform == "nt": errors = "strict" else: errors = "surrogateescape" encoded_name = file_name.encode(sys.getfilesystemencoding() errors=errors) I think there should be (1) some documentation on the issue and (2) a more obvious way to do encode filenames. 1. The most useful reference I could find in the docs is http://docs.python.org/dev/c-api/unicode.html#file-system-encoding and there is a short paragraph at http://docs.python.org/dev/library/os.html#file-names-command-line-arguments... The filename encoding applies to basically all Python library functions (including built-ins like `open()`) and should probably be documented at a more prominent spot. The "surrogateescape" error handler isn't mentioned here http://docs.python.org/dev/howto/unicode.html#unicode-filenames 2. There should be some way to access the C API functions for decoding and encoding filenames from Python. I don't have a good idea how to do this – maybe by adding a meta-encoding "filesystem", or by adding functions to the standard library. Did I miss something? Any thoughts? Cheers, Sven

On Mon, 16 Jul 2012 15:49:52 +0100 Sven Marnach <sven@marnach.net> wrote:
Well, how about os.fsencode() and os.fsdecode()? http://docs.python.org/dev/library/os.html#os.fsencode Regards Antoine. -- Software development and contracting: http://pro.pitrou.net

Antoine Pitrou schrieb am Mon, 16. Jul 2012, um 17:49:56 +0200:
Oh, great, there they are! I think these functions should be mentioned in these sections to make them easier to find: [1]: http://docs.python.org/dev/library/os.html#file-names-command-line-arguments... [2]: http://docs.python.org/dev/library/sys.html#sys.getfilesystemencoding [3]: http://docs.python.org/dev/howto/unicode.html#unicode-filenames I'll post an issue on the issue tracker. Cheers, Sven

On 16/07/12 18:23, Victor Stinner wrote:
Sure. But should we be encouraging their use on Windows? I would have thought it the best thing to stick with the Unicode string for paths on NT, so that the native Win32 Unicode APIs are used instead of the ANSI-code-page-bound C stdio. Encoding down to the fsencoding for Windows just means that any path including a character that isn't in the ANSI CP will fail. In lieu of some kind of abstract filepath object thatcould represent either bytes or str (depending on platform), how about a function that takes a str and only encodes it to bytes if the platform requires it? cheers, -- And Clover mailto:and@doxdesk.com http://www.doxdesk.com/ gtalk:chat?jid=bobince@gmail.com

On Tue, 17 Jul 2012 00:00:32 +0100 And Clover <and-dev@doxdesk.com> wrote:
Well even under Unix, these functions are only useful for very specialized cases. For normal usage, PEP 383 guarantees that all filenames, including theoretically undecodable ones, pass through properly. When piping filenames between Python processes, you can use whatever encoding you want (or you can also use json or pickle). The only remaining use case is sending some filenames to an external (non-Python) program over a bytes stream, or reading some filenames emitted by such a program. Here, you need bytes under Windows as well. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net

Victor Stinner, 17.07.2012 03:03:
That's not the main use case I see, though. When talking to C libraries, for example, they will usually require a byte encoded file path and also return one. Getting the encoding right in this case is really not trivial. I would expect that the above functions do "the right thing" also on Windows here, unless the library really has a win32 specific file API (and that's not likely). Stefan

MRAB schrieb am Tue, 17. Jul 2012, um 22:52:57 +0100:
I don't have control over the other process (it's ExifTool in batch mode), so I have to use whatever encoding is considered the standard to encode filenames on Windows. `os.fsencode()` works fine for this, and Victor answered off-list that it would be fine in this case. Cheers, Sven

os.fsencode()/fsdecode() are not specific to filesystems: you can use these functions to encode/decode command line arguments, environment variable, text from/to a console (sys.std*), etc. The "fs" letters from the name comes from the encoding used by these functions: sys.get*filesystem*encoding(). For example, os.fsencode() used by the subprocess module and posixpath.expanduser() modules, and os.fsdecode() is used by os.get_exec_path() and shutil.rmtree(). Victor

On Mon, 16 Jul 2012 15:49:52 +0100 Sven Marnach <sven@marnach.net> wrote:
Well, how about os.fsencode() and os.fsdecode()? http://docs.python.org/dev/library/os.html#os.fsencode Regards Antoine. -- Software development and contracting: http://pro.pitrou.net

Antoine Pitrou schrieb am Mon, 16. Jul 2012, um 17:49:56 +0200:
Oh, great, there they are! I think these functions should be mentioned in these sections to make them easier to find: [1]: http://docs.python.org/dev/library/os.html#file-names-command-line-arguments... [2]: http://docs.python.org/dev/library/sys.html#sys.getfilesystemencoding [3]: http://docs.python.org/dev/howto/unicode.html#unicode-filenames I'll post an issue on the issue tracker. Cheers, Sven

On 16/07/12 18:23, Victor Stinner wrote:
Sure. But should we be encouraging their use on Windows? I would have thought it the best thing to stick with the Unicode string for paths on NT, so that the native Win32 Unicode APIs are used instead of the ANSI-code-page-bound C stdio. Encoding down to the fsencoding for Windows just means that any path including a character that isn't in the ANSI CP will fail. In lieu of some kind of abstract filepath object thatcould represent either bytes or str (depending on platform), how about a function that takes a str and only encodes it to bytes if the platform requires it? cheers, -- And Clover mailto:and@doxdesk.com http://www.doxdesk.com/ gtalk:chat?jid=bobince@gmail.com

On Tue, 17 Jul 2012 00:00:32 +0100 And Clover <and-dev@doxdesk.com> wrote:
Well even under Unix, these functions are only useful for very specialized cases. For normal usage, PEP 383 guarantees that all filenames, including theoretically undecodable ones, pass through properly. When piping filenames between Python processes, you can use whatever encoding you want (or you can also use json or pickle). The only remaining use case is sending some filenames to an external (non-Python) program over a bytes stream, or reading some filenames emitted by such a program. Here, you need bytes under Windows as well. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net

Victor Stinner, 17.07.2012 03:03:
That's not the main use case I see, though. When talking to C libraries, for example, they will usually require a byte encoded file path and also return one. Getting the encoding right in this case is really not trivial. I would expect that the above functions do "the right thing" also on Windows here, unless the library really has a win32 specific file API (and that's not likely). Stefan

MRAB schrieb am Tue, 17. Jul 2012, um 22:52:57 +0100:
I don't have control over the other process (it's ExifTool in batch mode), so I have to use whatever encoding is considered the standard to encode filenames on Windows. `os.fsencode()` works fine for this, and Victor answered off-list that it would be fine in this case. Cheers, Sven

os.fsencode()/fsdecode() are not specific to filesystems: you can use these functions to encode/decode command line arguments, environment variable, text from/to a console (sys.std*), etc. The "fs" letters from the name comes from the encoding used by these functions: sys.get*filesystem*encoding(). For example, os.fsencode() used by the subprocess module and posixpath.expanduser() modules, and os.fsdecode() is used by os.get_exec_path() and shutil.rmtree(). Victor
participants (8)
-
And Clover
-
Antoine Pitrou
-
Mark Lawrence
-
MRAB
-
Ned Batchelder
-
Stefan Behnel
-
Sven Marnach
-
Victor Stinner