[Python-3000] [Python-Dev] Filename as byte string in python 2.6 or 3.0?
Marcin 'Qrczak' Kowalczyk
qrczak at knm.org.pl
Fri Oct 3 23:23:48 CEST 2008
2008/10/3 Glenn Linderman <v+python at g.nevcal.com>:
> My understanding of the Posix file names is that any byte values are valid
> except "/" and null. Is this a correct understanding?
Yes (well, names "." and ".." are reserved, and there might be length
restrictions).
> The UTF-8b proposal seems to translate from a non-UTF-8 byte stream to a
> Unicode character stream. Call the original byte stream FOO. The
> transformation then produces FOOTR, a set of Unicode code points. Now FOOTR
> has a representation in UTF-8, which is a byte stream, call that byte stream
> FOOTRUTF8. How, by looking at FOOTR, do you know whether it represents the
> file name FOO or FOOTRUTF8 ?
In the unpaired surrogate scheme: there is no FOOTRUTF8 because UTF-8
can encode only Unicode scalar values (which exclude surrogates).
Python strings can contain surrogates (in 4-byte builds) or unpaired
surrogates which are malformed UTF-16 (in 2-byte builds) — in the
filename context they can't be represented in UTF-8 so they must mean
escaped bytes.
In the U+0000 scheme: FOOTRUTF8 contains a 0 byte, so the filename
must mean FOO.
> but if it
> introduces null characters into the translated "file name", then there is
> file name parsing software that it will be incompatible with, which may be
> as problematic as not translating the file names in the first place...
What do you mean by "not translating"? If a piece of software
validates filenames while they are represented by Unicode strings,
then they must have been somehow translated from byte strings (on
POSIX) or UTF-16-assumed-but-not-guaranteed strings (on Windows).
--
Marcin Kowalczyk
qrczak at knm.org.pl
http://qrnik.knm.org.pl/~qrczak/
More information about the Python-3000
mailing list