[Python-Dev] Filename as byte string in python 2.6 or 3.0?

Tue Sep 30 16:22:11 CEST 2008

On Tue, Sep 30, 2008 at 2:45 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> Adam Olsen wrote:
>> Lossy conversion just moves around what gets treated as garbage.  As
>> all valid unicode scalars can be round tripped, there's no way to
>> create a valid unicode file name without being lossy.  The alternative
>> is not be valid unicode, but since we can't use such objects with
>> external libs, can't even print them, we might as well call them
>> something else.  We already have a name for that: bytes.
>
> To my mind, there are two kinds of app in the world when it comes to
> file paths:
> 1) "Normal" apps (e.g. a word processor), that are only interested in
> files with sane, well-formed file names that can be properly decoded to
> Unicode with the filesystem encoding identified by Python. If there is
> invalid data on the filesystem, they don't care and don't want to see it
> or have to deal with it.
> 2) "Filesystem" apps (e.g. a filesystem explorer), that need to be able
> to deal with malformed filenames that may not decode properly using the
> identified filesystem encoding.
>
> For the former category of apps, the presence of a malformed filename
> should NOT disrupt the processing of well-formed files and directories.
> Those applications should "just work", even if the underlying filesystem
> has a few broken filenames.

Right. Totally agreed.

> The latter category of applications need some way of defining their own
> application-specific handling of malformed names.

Agreed again.

> That screams "callback" to me - and one mechanism to achieve that would
> be to expose the unicode "errors" argument for filesystem operations
> that return file paths (e.g. os.getcwd(), os.listdir(), os.readlink(),
> os.walk()).

Hm. This doesn't scream callback to me at all. I would never have
thought of callbacks for this use case -- and I don't think it's a
good idea. The callback would either be an extra argument to all
system calls (bad, ugly etc., and why not go with the existing unicode
encoding and error flags if we're adding extra args?) or would be
global, where I'd be worried that it might interfere with the proper
operation of library code that is several abstractions away from
whoever installed the callback, not under their control, and not
expecting the callback.

I suppose I may have totally misunderstood your proposal, but in
general I find callbacks unwieldy.

> Once that was exposed, the existing error handling machinery in the
> codecs module could be used to allow applications to define their own
> custom error handling for Unicode decode errors in these operations.
> (e.g. set "codecs.register_error('bad_filepath',
> handle_filepath_error)", then use "errors='bad_filepath'" in the
> relevant os API calls)
>
> The default handling could be left at "strict", with os.listdir() and
> os.walk() specifically ignoring path entries that trigger
> UnicodeDecodeError.
>
> getcwd() and readlink() could just propagate the exception, since they
> have no other information to return.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)