[Python-Dev] Filename as byte string in python 2.6 or 3.0?

Victor Stinner victor.stinner at haypocalc.com
Sat Sep 27 14:04:25 CEST 2008


I read that Python 2.6 is planned to Wednesday. One bug is is still open and 
important for me: Python 2.6/3.0 are unable to use filename as byte strings.

The problem

On Windows, all filenames are unicode strings (I guess UTF-16-LE), but on UNIX 
for historical reasons, filenames are byte strings. On Linux, you can expect 
UTF-8 valid filenames but sometimes (eg. copy from a FAT32 USB key to an ext3 
filesystem) you get invalid filename (byte string in a different charset than 
your default filesystem encoding (utf8)).

Python functions using filenames

In Python, you have (incomplete list):
 - filename producer: os.listdir(), os.walk(), glob.glob()
 - filename manipulation: os.path.*()
 - access file: open(), os.unlink(), shutil.rmtree()

If you give unicode to producer, they return unicode _or_ byte strings (type 
may change for each filename :-/). Guido proposed to break this behaviour: 
raise an exception if unicode conversion fails. We may consider an option 
like "skip invalid".

If you give bytes to producer, they only return byte strings. Great.

Filename manipulation: in python 2.6/3.0, os.path.*() is not compatible with 
the type "bytes". So you can not use os.path.join(<your unicode path>, <bytes 
filename>) *nor* os.path.join(<your bytes path>, <bytes filename>) because 
os.path.join() (eg. with the posix version) uses path.endswith('/').

Access file: open() rejects the type bytes (it's just a test, open() supports 
bytes if you remove the test). As I remember, unlink() is compatible with 
bytes. But rmtree() fails because it uses os.path.join() (even if you give 
bytes directory, join() fails).


 - producer: unicode => *only* unicode // bytes => bytes
 - manipulation: support both unicode and bytes but avoid (when it's possible)
   to mix bytes and characters
 - open(): allow bytes

I implemented these solutions as a patch set attached to the issue #3187:
 * posix_path_bytes.patch: fix posixpath.join() to support bytes
 * io_byte_filename.patch: open() allows bytes filename
 * fnmatch_bytes.patch: patch fnmatch.filter() to accept bytes filenames
 * glob1_bytes.patch: fix glob.glob() to accept invalid directory name

Mmmh, there is no patch for stop os.listdir() on invalid filename.


I think that the problem is important because it's a regression from 2.5 to 
2.6/3.0. Python 2.5 uses bytes filename, so it was possible to 
open/unlink "invalid" unicode strings (since it's not unicode but bytes).

Well, if it's too late for the final versions, this problem should be at least 
fixed quickly.

Test the problem

Example to create invalid filenames on Linux:

$ mkdir /tmp/test
$ cd /tmp/test
$ touch $(echo -e "a\xffb")
$ mkdir $(echo -e "dir\xffname")
$ touch $(echo -e "dir\xffname/file")
$ find

Python 2.5:
>>> import os
>>> os.listdir('.')
['a\xffb', 'dir\xffname']
>>> open(os.listdir('.')[0]).close()  # open file: ok
>>> os.unlink(os.listdir('.')[0])     # remove file: ok
>>> os.listdir('.')
>>> shutil.rmtree(os.listdir('.')[0])  # remove dir: ok

Wrong solutions

New type

I proposed an ugly type "InvalidFilename" mixing bytes and characters. As 
everybody using unicode knows, it's a bad idea :-) (and it introduces a new 

Convert bytes to unicode (replace)

unicode_filename = unicode(bytes_filename, charset, "replace")

Ok, you will get valid unicode strings which can be used in os.path.join() & 
friends, but open() or unlink() will fails because this filename doesn't 

Victor Stinner aka haypo

More information about the Python-Dev mailing list