os.lisdir, gets unicode, returns unicode... USUALLY?!?!?

gabor gabor at nekomancer.net
Thu Nov 16 22:05:19 CET 2006


from the documentation (http://docs.python.org/lib/os-file-dir.html) for 

"On Windows NT/2k/XP and Unix, if path is a Unicode object, the result 
will be a list of Unicode objects."

i'm on Unix. (linux, ubuntu edgy)

so it seems that it does not always return unicode filenames.

it seems that it tries to interpret the filenames using the filesystem's 
encoding, and if that fails, it simply returns the filename as byte-string.

so you get back let's say an array of 21 filenames, from which 3 are 
byte-strings, and the rest unicode strings.

after digging around, i found this in the source code:

>                 if (arg_is_unicode) {
>                         PyObject *w;
>                         w = PyUnicode_FromEncodedObject(v,
>                                         Py_FileSystemDefaultEncoding,
>                                         "strict");
>                         if (w != NULL) {
>                                 Py_DECREF(v);
>                                 v = w;
>                         }
>                         else {
>                                 /* fall back to the original byte string, as
>                                    discussed in patch #683592 */
>                                 PyErr_Clear();
>                         }
>                 }
> #endif

so if the to-unicode-conversion fails, it falls back to the original 
byte-string. i went and have read the patch-discussion.

and now i'm not sure what to do.
i know that:

1. the documentation is completely wrong. it does not always return 
unicode filenames
2. it's true that the documentation does not specify what happens if the 
filename is not in the filesystem-encoding, but i simply expected that i 
get an Unicode-exception, as everywhere else. you see, exceptions are 
ok, i can deal with them. but this is just plain wrong. from now on, 
EVERYWHERE where i use os.listdir, i will have to go through all the 
filenames in it, and check if they are unicode-strings or not.

so basically i'd like to ask here: am i reading something incorrectly? 
or am i using os.listdir the "wrong way"? how do other people deal with 

p.s: one additional note. if you code expects os.listdir to return 
unicode, that usually means that all your code uses unicode strings. 
which in turn means, that those filenames will somehow later interact 
with unicode strings. which means that that byte-string-filename will 
probably get auto-converted to unicode at a later point, and that 
auto-conversion will VERY probably fail, because the auto-convert only 
happens using 'ascii' as the encoding, and if it was not possible to 
decode the filename inside listdir, it's quite probable that it also 
will not work using 'ascii' as the charset.


More information about the Python-list mailing list