[Python-Dev] Filename as byte string in python 2.6 or 3.0?

27 Sep 2008

      Hi,

I read that Python 2.6 is planned to Wednesday. One bug is is still open and 
important for me: Python 2.6/3.0 are unable to use filename as byte strings.
    http://bugs.python.org/issue3187

The problem
===========

On Windows, all filenames are unicode strings (I guess UTF-16-LE), but on UNIX 
for historical reasons, filenames are byte strings. On Linux, you can expect 
UTF-8 valid filenames but sometimes (eg. copy from a FAT32 USB key to an ext3 
filesystem) you get invalid filename (byte string in a different charset than 
your default filesystem encoding (utf8)).

Python functions using filenames
================================

In Python, you have (incomplete list):
 - filename producer: os.listdir(), os.walk(), glob.glob()
 - filename manipulation: os.path.*()
 - access file: open(), os.unlink(), shutil.rmtree()

If you give unicode to producer, they return unicode _or_ byte strings (type 
may change for each filename :-/). Guido proposed to break this behaviour: 
raise an exception if unicode conversion fails. We may consider an option 
like "skip invalid".

If you give bytes to producer, they only return byte strings. Great.

Filename manipulation: in python 2.6/3.0, os.path.*() is not compatible with 
the type "bytes". So you can not use os.path.join(<your unicode path>, <bytes 
filename>) *nor* os.path.join(<your bytes path>, <bytes filename>) because 
os.path.join() (eg. with the posix version) uses path.endswith('/').

Access file: open() rejects the type bytes (it's just a test, open() supports 
bytes if you remove the test). As I remember, unlink() is compatible with 
bytes. But rmtree() fails because it uses os.path.join() (even if you give 
bytes directory, join() fails).

Solutions
=========

 - producer: unicode => *only* unicode // bytes => bytes
 - manipulation: support both unicode and bytes but avoid (when it's possible)
   to mix bytes and characters
 - open(): allow bytes

I implemented these solutions as a patch set attached to the issue #3187:
 * posix_path_bytes.patch: fix posixpath.join() to support bytes
 * io_byte_filename.patch: open() allows bytes filename
 * fnmatch_bytes.patch: patch fnmatch.filter() to accept bytes filenames
 * glob1_bytes.patch: fix glob.glob() to accept invalid directory name

Mmmh, there is no patch for stop os.listdir() on invalid filename.

Priority
========

I think that the problem is important because it's a regression from 2.5 to 
2.6/3.0. Python 2.5 uses bytes filename, so it was possible to 
open/unlink "invalid" unicode strings (since it's not unicode but bytes).

Well, if it's too late for the final versions, this problem should be at least 
fixed quickly.

Test the problem
================

Example to create invalid filenames on Linux:

$ mkdir /tmp/test
$ cd /tmp/test
$ touch $(echo -e "a\xffb")
$ mkdir $(echo -e "dir\xffname")
$ touch $(echo -e "dir\xffname/file")
$ find
.
./a?b
./dir?name
./dir?name/file

Python 2.5:
...
...
...
import os
os.listdir('.')
['a\xffb', 'dir\xffname']
open(os.listdir('.')[0]).close()  # open file: ok
os.unlink(os.listdir('.')[0])     # remove file: ok
os.listdir('.')
['dir\xffname']
shutil.rmtree(os.listdir('.')[0])  # remove dir: ok
Wrong solutions
===============

New type
--------

I proposed an ugly type "InvalidFilename" mixing bytes and characters. As 
everybody using unicode knows, it's a bad idea :-) (and it introduces a new 
type).

Convert bytes to unicode (replace)
----------------------------------

unicode_filename = unicode(bytes_filename, charset, "replace")

Ok, you will get valid unicode strings which can be used in os.path.join() & 
friends, but open() or unlink() will fails because this filename doesn't 
exist!

-- 
Victor Stinner aka haypo
http://www.haypocalc.com/blog/