[Python-3000] Unicode and OS strings
Stephen J. Turnbull
stephen at xemacs.org
Wed Sep 19 18:12:51 CEST 2007
Victor Stinner writes:
> On Thursday 13 September 2007 18:22:12 Marcin 'Qrczak' Kowalczyk wrote:
> > What should happen when a command line argument or an environment
> > variable is not decodable using the system encoding (on Unix where
> > from the OS point of view it is an array of bytes)?
> On Linux, filenames are *byte* string and not *character* string. I always
> have his problem with Python 2.x. I converted filename (argv[x]) to Unicode
> to be able to format error messages in full unicode... but it's not possible.
> Linux allows invalid utf8 filename even on full utf8 installation (ubuntu),
> see Marcin's examples.
This should be solved by providing library facilities to handle these
conditions. Users and programmers may "know" that file names are
actually raw bytes obeying a set of restrictions unique to file names,
but they expect to be able to *use* them as characters, and 99.44% of
the time that just works. Even for the Japanese, who have
over 1500 years' experience in creating unusable writing systems.<wink>
> So I propose to keep sys.argv as byte string array. If you try to create
> unicode strings, you will be unable to write a program to convert filesystem
> with "broken" filenames (see convmv program for example) or open file with
> broken "filename" (broken: invalid byte sequence for UTF/JIS/Big5/...
This is simply not true. Any of the proposals (Martin's, Marcin's,
James's, mine) will make this *possible*. It's just less convenient
for the programmer who wishes to deal with such situations. This
inconvenience is IMO more than balanced by the convenience for the
programmer who lives his life in ASCII or whose users just don't do
stuff like that, or who's writing a one-off script and doesn't care.
N.B. You don't need to go farther than your favorite rootkit to find
broken filenames such as "^J" (linefeed). This doesn't cause problems
specific to Unicode, of course, but it does demonstrate that a
library designed to help with weird file names has broader
applicability than just translation to Unicode strings.
 "99.44%" is an expression of "very pure" derived from an
advertising campaign for soap. Here it's an exaggeration, I guess,
but nobody knows how much.
More information about the Python-3000