[Python-Dev] Filename as byte string in python 2.6 or 3.0?

Tue Sep 30 11:45:31 CEST 2008

Adam Olsen wrote:
> Lossy conversion just moves around what gets treated as garbage.  As
> all valid unicode scalars can be round tripped, there's no way to
> create a valid unicode file name without being lossy.  The alternative
> is not be valid unicode, but since we can't use such objects with
> external libs, can't even print them, we might as well call them
> something else.  We already have a name for that: bytes.

To my mind, there are two kinds of app in the world when it comes to
file paths:
1) "Normal" apps (e.g. a word processor), that are only interested in
files with sane, well-formed file names that can be properly decoded to
Unicode with the filesystem encoding identified by Python. If there is
invalid data on the filesystem, they don't care and don't want to see it
or have to deal with it.
2) "Filesystem" apps (e.g. a filesystem explorer), that need to be able
to deal with malformed filenames that may not decode properly using the
identified filesystem encoding.

For the former category of apps, the presence of a malformed filename
should NOT disrupt the processing of well-formed files and directories.
Those applications should "just work", even if the underlying filesystem
has a few broken filenames.

The latter category of applications need some way of defining their own
application-specific handling of malformed names.

That screams "callback" to me - and one mechanism to achieve that would
be to expose the unicode "errors" argument for filesystem operations
that return file paths (e.g. os.getcwd(), os.listdir(), os.readlink(),
os.walk()).

Once that was exposed, the existing error handling machinery in the
codecs module could be used to allow applications to define their own
custom error handling for Unicode decode errors in these operations.
(e.g. set "codecs.register_error('bad_filepath',
handle_filepath_error)", then use "errors='bad_filepath'" in the
relevant os API calls)

The default handling could be left at "strict", with os.listdir() and
os.walk() specifically ignoring path entries that trigger
UnicodeDecodeError.

getcwd() and readlink() could just propagate the exception, since they
have no other information to return.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
            http://www.boredomandlaziness.org