os.lisdir, gets unicode, returns unicode... USUALLY?!?!?

"Martin v. Löwis" martin at v.loewis.de
Mon Nov 20 18:13:26 EST 2006


Ross Ridge schrieb:
> Ross Ridge schrieb:
>> That would conflict with private use characters appearing in file
>> names.
> 
> Martin v. Löwis wrote:
>> Not necessarily: they could get escaped.
> 
> How?

Suppose I use U+E001..U+E0FF as the PUA characters for unencodable
bytes; U+E000 wouldn't be needed since it \0 cannot be part of
a file name in POSIX.

Then I would use U+E000 for escaping. Each PUA character in the
listed file name would get escaped with U+E000 in the Python
string; when the file name is converted back to the system, it
gets unescaped.

Notice that I think this is a really unrealistic case - I expect
that all file names containing PUA characters were deliberately
crafted to investigate using PUA characters in file names.

>> AFAICT, you can have that conflict only if the file system encoding
>> is UTF-8: otherwise, there is no way to represent them.
> 
> They can also appear UTF-16 filenames (obviously) and various Far-East
> multi-byte encodings.

No: UTF-16 file names cannot occur in POSIX, as this is not a null-byte
free encoding. What Far-East multi-byte encoding uses PUA characters,
and for what characters?

> No, I just expect that if the underlying file system API does accept a
> given byte or Unicode string that I could pass the same string to
> open() and stat(), etc.. and have it work.

On no operating system I'm aware of can you pass "Unicode strings" to
open() or stat(). You always have to find some byte encoding as
parameters for open() and stat(), because that's what POSIX specifies.

> Should I assume that since you think that having "os.listdir()" return
> Unicode strings when passed a Unicode directory name is a good idea,
> that you also think that file object methods (eg. readline) should
> return Unicode strings when opened with a Unicode filename?

No, not at all. How file names are interpreted is entirely independent
on how file content is interpreted.

Many people believe file names are character strings, and use them
as such in every day's life. OTOH, many people are aware that the
file contents of a file isn't necessarily plain text - most people
are familiar with PDF and executable files.

> On Windows you can use GetVolumeInformation(), though it may be more
> practical to assume Unicode or byte strings based on the OS.  On Unix
> you'd assume byte strings.

On Windows, the entire issue doesn't exist: We don't use open() or
stat() on Windows. If we have a Unicode file name on Windows, we
use the system's Unicode API.

>> Does OSX use Unicode (it requires path names to be UTF-8)?
> 
> HFS+ uses Unicode.  I have no idea how you'd figure out the properties
> of a filesystem under OS/X, but then the Python docs suggests this
> os.listdir() Unicode feature doesn't work on Macintosh systems anyways.

Either the docs are wrong, or you are misinterpreting them. It works
just fine in practice.

> That's the problem here, there's no
> encoding associated Unix filenames, they're just byte strings. 

Can you please quote chapter and verse of the POSIX spec that says
so? I believe POSIX specifies the entire opposite:

http://www.opengroup.org/onlinepubs/007908799/xbd/glossary.html#tag_004_000_114

says

# A name consisting of 1 to {NAME_MAX} bytes used to name a file. The
# characters composing the name may be selected from the set of all
# character values excluding the slash character and the null byte. The
# filenames dot and dot-dot have special meaning; see pathname
# resolution . A filename is sometimes referred to as a pathname
# component.
#
# Filenames should be constructed from the portable filename character
# set because the use of other characters can be confusing or ambiguous
# in certain contexts. (For instance, the use of a colon (:) in a
# pathname could cause ambiguity if that pathname were included in a
# PATH definition.)

So they are not "just byte strings"; they must come from the set of all
characters. "character" is defined as "A sequence of one or more bytes
representing a single graphic symbol or control code."

> Since
> Python byte strings also have no encoding associated with them they're
> the natural way of representing all valid file names on Unix systems.

And still, people want to render file names in a user interface to
the user.

Regards,
Martin



More information about the Python-list mailing list