Introduce some obvious way to encode and decode filenames from Python code
Currently, there is no obvious way to encode a filename in the default filesystem encoding. To pipe some filenames to the stdin of a subprocess, I effectively used encoded_name = file_name.encode(sys.getfilesystemencoding()) which mostly worked. There are cases where this fails, though: on Linux with LANG=C and filenames that contain non-ASCII characters, for example, or in any situation where the default filesystem encoding can't decode a filename. The correct way to do this seems to be something like if sys.platform == "nt": errors = "strict" else: errors = "surrogateescape" encoded_name = file_name.encode(sys.getfilesystemencoding() errors=errors) I think there should be (1) some documentation on the issue and (2) a more obvious way to do encode filenames. 1. The most useful reference I could find in the docs is http://docs.python.org/dev/c-api/unicode.html#file-system-encoding and there is a short paragraph at http://docs.python.org/dev/library/os.html#file-names-command-line-arguments... The filename encoding applies to basically all Python library functions (including built-ins like `open()`) and should probably be documented at a more prominent spot. The "surrogateescape" error handler isn't mentioned here http://docs.python.org/dev/howto/unicode.html#unicode-filenames 2. There should be some way to access the C API functions for decoding and encoding filenames from Python. I don't have a good idea how to do this – maybe by adding a meta-encoding "filesystem", or by adding functions to the standard library. Did I miss something? Any thoughts? Cheers, Sven
On Mon, 16 Jul 2012 15:49:52 +0100 Sven Marnach <sven@marnach.net> wrote:
Currently, there is no obvious way to encode a filename in the default filesystem encoding. To pipe some filenames to the stdin of a subprocess, I effectively used
encoded_name = file_name.encode(sys.getfilesystemencoding())
Well, how about os.fsencode() and os.fsdecode()? http://docs.python.org/dev/library/os.html#os.fsencode Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
Antoine Pitrou schrieb am Mon, 16. Jul 2012, um 17:49:56 +0200:
On Mon, 16 Jul 2012 15:49:52 +0100 Sven Marnach <sven@marnach.net> wrote:
Currently, there is no obvious way to encode a filename in the default filesystem encoding. To pipe some filenames to the stdin of a subprocess, I effectively used
encoded_name = file_name.encode(sys.getfilesystemencoding())
Well, how about os.fsencode() and os.fsdecode()?
Oh, great, there they are! I think these functions should be mentioned in these sections to make them easier to find: [1]: http://docs.python.org/dev/library/os.html#file-names-command-line-arguments... [2]: http://docs.python.org/dev/library/sys.html#sys.getfilesystemencoding [3]: http://docs.python.org/dev/howto/unicode.html#unicode-filenames I'll post an issue on the issue tracker. Cheers, Sven
Well, how about os.fsencode() and os.fsdecode()?
Oh, great, there they are! I think these functions should be mentioned in these sections to make them easier to find:
[1]: http://docs.python.org/dev/library/os.html#file-names-command-line-arguments... [2]: http://docs.python.org/dev/library/sys.html#sys.getfilesystemencoding [3]: http://docs.python.org/dev/howto/unicode.html#unicode-filenames
I'll post an issue on the issue tracker.
Hi, I wrote these functions when I worked in this topic for Python 3. Yes, it would be great if you write a patch to mention these functions in the doc. Someone also complained that the surrogateescape error handler is not mentionned in any FS related function. Victor
On 16/07/12 18:23, Victor Stinner wrote:
I wrote these functions when I worked in this topic for Python 3. Yes, it would be great if you write a patch to mention these functions in the doc.
Sure. But should we be encouraging their use on Windows? I would have thought it the best thing to stick with the Unicode string for paths on NT, so that the native Win32 Unicode APIs are used instead of the ANSI-code-page-bound C stdio. Encoding down to the fsencoding for Windows just means that any path including a character that isn't in the ANSI CP will fail. In lieu of some kind of abstract filepath object thatcould represent either bytes or str (depending on platform), how about a function that takes a str and only encodes it to bytes if the platform requires it? cheers, -- And Clover mailto:and@doxdesk.com http://www.doxdesk.com/ gtalk:chat?jid=bobince@gmail.com
On Tue, 17 Jul 2012 00:00:32 +0100 And Clover <and-dev@doxdesk.com> wrote:
On 16/07/12 18:23, Victor Stinner wrote:
I wrote these functions when I worked in this topic for Python 3. Yes, it would be great if you write a patch to mention these functions in the doc.
Sure.
But should we be encouraging their use on Windows? I would have thought it the best thing to stick with the Unicode string for paths on NT, so that the native Win32 Unicode APIs are used instead of the ANSI-code-page-bound C stdio. Encoding down to the fsencoding for Windows just means that any path including a character that isn't in the ANSI CP will fail.
Well even under Unix, these functions are only useful for very specialized cases. For normal usage, PEP 383 guarantees that all filenames, including theoretically undecodable ones, pass through properly. When piping filenames between Python processes, you can use whatever encoding you want (or you can also use json or pickle). The only remaining use case is sending some filenames to an external (non-Python) program over a bytes stream, or reading some filenames emitted by such a program. Here, you need bytes under Windows as well. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
2012/7/17 And Clover <and-dev@doxdesk.com>:
But should we be encouraging their use on Windows? I would have thought it the best thing to stick with the Unicode string for paths on NT, so that the native Win32 Unicode APIs are used instead of the ANSI-code-page-bound C stdio. Encoding down to the fsencoding for Windows just means that any path including a character that isn't in the ANSI CP will fail.
os.fsencode() should not be used explicitly on Windows.
In lieu of some kind of abstract filepath object thatcould represent either bytes or str (depending on platform), how about a function that takes a str and only encodes it to bytes if the platform requires it?
You can use the str (Unicode) type on all platforms with Python 3, so use os.fsdecode(). os.listdir(str) does return str filenames on any platform for example. Victor
Victor Stinner, 17.07.2012 03:03:
2012/7/17 And Clover:
But should we be encouraging their use on Windows? I would have thought it the best thing to stick with the Unicode string for paths on NT, so that the native Win32 Unicode APIs are used instead of the ANSI-code-page-bound C stdio. Encoding down to the fsencoding for Windows just means that any path including a character that isn't in the ANSI CP will fail.
os.fsencode() should not be used explicitly on Windows.
In lieu of some kind of abstract filepath object thatcould represent either bytes or str (depending on platform), how about a function that takes a str and only encodes it to bytes if the platform requires it?
You can use the str (Unicode) type on all platforms with Python 3, so use os.fsdecode(). os.listdir(str) does return str filenames on any platform for example.
That's not the main use case I see, though. When talking to C libraries, for example, they will usually require a byte encoded file path and also return one. Getting the encoding right in this case is really not trivial. I would expect that the above functions do "the right thing" also on Windows here, unless the library really has a win32 specific file API (and that's not likely). Stefan
On 17/07/2012 22:25, Sven Marnach wrote:
Victor Stinner schrieb am Tue, 17. Jul 2012, um 03:03:24 +0200:
os.fsencode() should not be used explicitly on Windows.
What else should I do to pipe filenames to another process? At least, os.fsencode() seems to work, even with cyrillic filenames.
Encode to UTF-8?
MRAB schrieb am Tue, 17. Jul 2012, um 22:52:57 +0100:
On 17/07/2012 22:25, Sven Marnach wrote:
Victor Stinner schrieb am Tue, 17. Jul 2012, um 03:03:24 +0200:
os.fsencode() should not be used explicitly on Windows.
What else should I do to pipe filenames to another process? At least, os.fsencode() seems to work, even with cyrillic filenames.
Encode to UTF-8?
I don't have control over the other process (it's ExifTool in batch mode), so I have to use whatever encoding is considered the standard to encode filenames on Windows. `os.fsencode()` works fine for this, and Victor answered off-list that it would be fine in this case. Cheers, Sven
On 7/16/2012 11:49 AM, Antoine Pitrou wrote:
On Mon, 16 Jul 2012 15:49:52 +0100 Sven Marnach <sven@marnach.net> wrote:
Currently, there is no obvious way to encode a filename in the default filesystem encoding. To pipe some filenames to the stdin of a subprocess, I effectively used
encoded_name = file_name.encode(sys.getfilesystemencoding()) Well, how about os.fsencode() and os.fsdecode()?
http://docs.python.org/dev/library/os.html#os.fsencode It's too bad these are not called os.path.encode() and os.path.decode(), since they fit so nicely into os.path's charter of manipulating strings representing file paths.
--Ned.
Regards
Antoine.
Well, how about os.fsencode() and os.fsdecode()?
It's too bad these are not called os.path.encode() and os.path.decode(), since they fit so nicely into os.path's charter of manipulating strings representing file paths.
os.fsencode()/fsdecode() are not specific to filesystems: you can use these functions to encode/decode command line arguments, environment variable, text from/to a console (sys.std*), etc. The "fs" letters from the name comes from the encoding used by these functions: sys.get*filesystem*encoding(). For example, os.fsencode() used by the subprocess module and posixpath.expanduser() modules, and os.fsdecode() is used by os.get_exec_path() and shutil.rmtree(). Victor
participants (8)
-
And Clover
-
Antoine Pitrou
-
Mark Lawrence
-
MRAB
-
Ned Batchelder
-
Stefan Behnel
-
Sven Marnach
-
Victor Stinner