unicode filenames

Mon Feb 3 03:39:39 EST 2003

Erik Max Francis wrote:

> Alex Martelli wrote:
> 
>> ALMOST entirely -- for example, none of the bytes is allowed to have
>> the value 47 (since that is the code for "slash" in ASCII).
> 
> I thought we would all be reasonable enough to implicitly understand
> that was a condition.  I thought of explicitly mentioning it, but
> thought it too obvious.  Just goes to show.

It appears to me that you may not fully understand the
implications of this -- or, am _I_ missing something...?

>> As long as the encoding never needs to use a byte whose value is
>> 47.  I think that rules out UTF-8 and most other popular
>> multi-byte encodings, doesn't it?
> 
> UTF-8 not including a slash.  Or UTF-16 not including a slash.  Or
> Latin-1 not including a slash.  And so on.

Latin-1 is not a multi-byte encoding, so, "no byte with a value
of 47" does indeed equate to "not including a slash".  But it
appears to me that you may be generalizing unduly: for a multi-byte
encoding, "no byte with a value of 47" is a MUCH stricter
condition than "not including a slash".

For example, the Lithuanian character "Latin small letter i with
ogonek", in Unicode, is represented by code 012F.  The Livonian
"Latin small letter o with dot above", by code 022F.  And so on.

So, for example, in UTF-16, a filename containing the former would 
have to include a byte of value 47 (0x2F), either right before
or right after a byte of value 0x01, depending on endianness.

Therefore, on a Unix system that is not specifically and
explicitly Unicode-aware, you could use filenames with UTF-16
encoding only if they didn't include ANY of: slash (obvious),
the said Lithuanian and Livonian characters (you _might_
perhaps use combinations instead -- 0069 0328 as equivalent
to 012F, for example -- but, isn't 00 the OTHER value you
are NOT allowed to use, besides 0x2F...?!-), the Cyrillic
capital letter Ya (042F, no combination equivalents that I
know of), the Arabic letter Dal (062F, no comb.), and so on,
and so forth.

Similar considerations apply for any other multibyte encoding
(such as, UTF-8) that is NOT specifically and carefully
designed to avoid ever needing a byte of value 47 (0x2F) in
order to represent ANY character except a slash.  I am not
aware of any such multi-byte encoding -- there may be some,
but, even if one can be found, using it would still fall WELL
short of "any other encoding whatsoever" as you claimed.

(Note in passing that careful avoidance of bytes with value
of ZERO for any character except NUL _is_ typical of the design 
of several popular multi-byte encodings -- not UTF-16, which
needs a most significant byte of value 00 to represent any
character in the code range 1-255, i.e. characters that are
also in Ascii and Iso-8859-1 -- but I recall, from back when
I worked with shifted-JIS and JIS-EUC in C, that one COULD
at least rely on a byte of value 0 always meaning end-of-
string, without accidentally hitting any such bytes within
the representation of any other character).

> The context is Unicode filenames; Unicode filenames on Windows certainly
> have similar restrictions; you can't put _any_ character in there and
> expect it to work (for precisely the reasons; I suspect Windows would
> restrict them more, in fact).  Same goes for a UNIX filesystem, so it's
> not like in context that limitation wasn't already apparent.

Unicode-supporting Windows filesystems (NTFS, in particular,
under Windows/NT, /2000, /XP) certainly do restrict some of
the punctuation you're allowed to use in filenames -- but,
being specifically Unicode-aware and using whatever encoding
THEY choose, they do NOT arbitrarily forbid you to use letters
such as Arabic Dal, Cyrillic capital Ya, and so on, just because
of the value that a byte happens to have for such a character
in some multi-byte encoding or other.

So, what _am_ I missing?  Can you please explain in more detail
your original claim that:
"""
It means that filenames are strings of bytes.  What the meaning of those
bytes are is entirely application dependent.  They could be raw ASCII
(the most common), Latin-1 (probably the most common with filenames that
contain bytes with the MSB set), or any other encoding whatsoever.  It's
applications that make the files, it's applications that decide what
encoding to use.
"""

Ascii, Latin-1, ISO-8859-whatever -- sure.  But -- "any other
encoding whatsoever", "it's applications that decide"?  And
specifically UTF-8 and UTF-16, just as long as no _slash_ is
there, as you very specifically claim in this post?  I guess
I'm thick, since you keep claiming it's all so obvious and
apparent -- but, can you PLEASE patiently explain in words
of one syllable how my application could decide to use e.g
UTF-16 and then name a file "Cyrillic upper Ya"+"Arabic Dal",
on a non-Unicode-aware Unix system?  Thanks!

Alex