Hi, I read that Python 2.6 is planned to Wednesday. One bug is is still open and important for me: Python 2.6/3.0 are unable to use filename as byte strings. http://bugs.python.org/issue3187 The problem =========== On Windows, all filenames are unicode strings (I guess UTF-16-LE), but on UNIX for historical reasons, filenames are byte strings. On Linux, you can expect UTF-8 valid filenames but sometimes (eg. copy from a FAT32 USB key to an ext3 filesystem) you get invalid filename (byte string in a different charset than your default filesystem encoding (utf8)). Python functions using filenames ================================ In Python, you have (incomplete list): - filename producer: os.listdir(), os.walk(), glob.glob() - filename manipulation: os.path.*() - access file: open(), os.unlink(), shutil.rmtree() If you give unicode to producer, they return unicode _or_ byte strings (type may change for each filename :-/). Guido proposed to break this behaviour: raise an exception if unicode conversion fails. We may consider an option like "skip invalid". If you give bytes to producer, they only return byte strings. Great. Filename manipulation: in python 2.6/3.0, os.path.*() is not compatible with the type "bytes". So you can not use os.path.join(<your unicode path>, <bytes filename>) *nor* os.path.join(<your bytes path>, <bytes filename>) because os.path.join() (eg. with the posix version) uses path.endswith('/'). Access file: open() rejects the type bytes (it's just a test, open() supports bytes if you remove the test). As I remember, unlink() is compatible with bytes. But rmtree() fails because it uses os.path.join() (even if you give bytes directory, join() fails). Solutions ========= - producer: unicode => *only* unicode // bytes => bytes - manipulation: support both unicode and bytes but avoid (when it's possible) to mix bytes and characters - open(): allow bytes I implemented these solutions as a patch set attached to the issue #3187: * posix_path_bytes.patch: fix posixpath.join() to support bytes * io_byte_filename.patch: open() allows bytes filename * fnmatch_bytes.patch: patch fnmatch.filter() to accept bytes filenames * glob1_bytes.patch: fix glob.glob() to accept invalid directory name Mmmh, there is no patch for stop os.listdir() on invalid filename. Priority ======== I think that the problem is important because it's a regression from 2.5 to 2.6/3.0. Python 2.5 uses bytes filename, so it was possible to open/unlink "invalid" unicode strings (since it's not unicode but bytes). Well, if it's too late for the final versions, this problem should be at least fixed quickly. Test the problem ================ Example to create invalid filenames on Linux: $ mkdir /tmp/test $ cd /tmp/test $ touch $(echo -e "a\xffb") $ mkdir $(echo -e "dir\xffname") $ touch $(echo -e "dir\xffname/file") $ find . ./a?b ./dir?name ./dir?name/file Python 2.5:
import os os.listdir('.') ['a\xffb', 'dir\xffname'] open(os.listdir('.')[0]).close() # open file: ok os.unlink(os.listdir('.')[0]) # remove file: ok os.listdir('.') ['dir\xffname'] shutil.rmtree(os.listdir('.')[0]) # remove dir: ok
Wrong solutions =============== New type -------- I proposed an ugly type "InvalidFilename" mixing bytes and characters. As everybody using unicode knows, it's a bad idea :-) (and it introduces a new type). Convert bytes to unicode (replace) ---------------------------------- unicode_filename = unicode(bytes_filename, charset, "replace") Ok, you will get valid unicode strings which can be used in os.path.join() & friends, but open() or unlink() will fails because this filename doesn't exist! -- Victor Stinner aka haypo http://www.haypocalc.com/blog/