unicode filenames

Alex Martelli aleax at aleax.it
Mon Feb 3 03:48:24 EST 2003


Andrew Bennetts wrote:

> On Mon, Feb 03, 2003 at 07:24:52AM +0000, Alex Martelli wrote:
>> Erik Max Francis wrote:
>>    ...
>> > It means that filenames are strings of bytes.  What the meaning of
>> > those
>> > bytes are is entirely application dependent.  They could be raw ASCII
>> 
>> ALMOST entirely -- for example, none of the bytes is allowed to have
>> the value 47 (since that is the code for "slash" in ASCII).
> 
> I believe slash and the null byte are the only disallowed characters in
> unix path names.

Bytes, not characters, IF you accept Erik's claim that the
application can freely decide on the encoding -- and there's
the rub, as least as far as I can see -- in the multibyte
encodings I know of, forbidding those two byte values (in
particular 47, i.e. 0x2F) ends up forbidding a LOT of
_characters_ -- because many characters may happen to need
a BYTE of value 0x2F as a part of their representation in
such an encoding.  Please see my other response to Erik in
this thread for a detailed explanation of where I see the
problem coming up, with reference to this.

If the system is not aware of any distinction between
bytes and characters in a filename (or other string that
is somehow relevant to the system), and in particular is
unaware of Unicode, then it appears to me that similar
limitations would always emerge regarding arbitrary use
of multi-byte encodings.  UTF-16 will in particular be
unusable if bytes with a value of 0 are prohibited.  Most
others, I believe, WOULD be usable if that was the only
prohibition (as they're carefully design around the
"null byte problem", so to speak) -- but the further
prohibition of value 47 (0x2F) seems to be a killer from
my point of view.

If I _am_ missing something "obvious and apparent", as
it would seem from Erik's response, I would definitely
appreciate being helped to understand it.  Otherwise, I
will operate on the working hypothesis that some people
do not understand the difference between "character"
and "byte" in the context of multi-byte encoding, and
that their claims that something is "apparent" and/or
"obvious" are therefore of somewhat dubious validity.


Alex





More information about the Python-list mailing list