[Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0?

Fri Oct 3 23:23:48 CEST 2008

2008/10/3 Glenn Linderman <v+python at g.nevcal.com>:

> My understanding of the Posix file names is that any byte values are valid
> except "/" and null.  Is this a correct understanding?

Yes (well, names "." and ".." are reserved, and there might be length
restrictions).

> The UTF-8b proposal seems to translate from a non-UTF-8 byte stream to a
> Unicode character stream.  Call the original byte stream FOO.  The
> transformation then produces FOOTR, a set of Unicode code points.  Now FOOTR
> has a representation in UTF-8, which is a byte stream, call that byte stream
> FOOTRUTF8.  How, by looking at FOOTR, do you know whether it represents the
> file name FOO or FOOTRUTF8 ?

In the unpaired surrogate scheme: there is no FOOTRUTF8 because UTF-8
can encode only Unicode scalar values (which exclude surrogates).
Python strings can contain surrogates (in 4-byte builds) or unpaired
surrogates which are malformed UTF-16 (in 2-byte builds) — in the
filename context they can't be represented in UTF-8 so they must mean
escaped bytes.

In the U+0000 scheme: FOOTRUTF8 contains a 0 byte, so the filename
must mean FOO.

> but if it
> introduces null characters into the translated "file name", then there is
> file name parsing software that it will be incompatible with, which may be
> as problematic as not translating the file names in the first place...

What do you mean by "not translating"? If a piece of software
validates filenames while they are represented by Unicode strings,
then they must have been somehow translated from byte strings (on
POSIX) or UTF-16-assumed-but-not-guaranteed strings (on Windows).

-- 
Marcin Kowalczyk
qrczak at knm.org.pl
http://qrnik.knm.org.pl/~qrczak/