[Python-Dev] PEP 277 (unicode filenames): please review
Martin v. Loewis
13 Aug 2002 00:50:51 +0200
Guido van Rossum <firstname.lastname@example.org> writes:
> But shouldn't it return Unicode whenever there are filenames in the
> directory that can't represented as ASCII?
Unfortunately, on Windows, there is no way to find out: If you use the
ANSI function (which not only covers ASCII, but the full user's code
page), and you have a file name not representable in this code page,
the system returns a file name that contains question marks.
Of course, you could always use the Win32 Wide API (unicode) function,
and convert the pure-ASCII strings into byte strings. That gives a
number of options:
- always return Unicode for Unicode directory argument,
- return Unicode only for non-ASCII, and only for Unicode argument,
- return Unicode only for non-ASCII, regardless of Unicode argument,
- return Unicode only for non-MBCS (again depending or not depending
on whether the argument is Unicode).
In the third case, if you have a non-representable file name, you
currently get a string like "??????.txt", whereas you then get
u"\uabcd\uefgh...txt". What might be worse: If the file name is
representable in "mbcs", yet outside ASCII, you currently get a "good"
byte string, and you get a Unicode string under option three.
So the MBCS options sound better. Unfortunately, testing whether a
string encodes as MBCS might be expensive.
> Hm, I don't know if I'd like os.listdir() to have an encoding
> argument. Sounds like the wrong solution somehow.
I don't like that, either.
> > Oh yes, the same reasoning would hold for readlink(), getcwd()
> > and any other call that returns filenames.
For readlink, if you trust FileSystemDefaultEncoding, you could return
a Unicode object if you find non-ASCII in the link contents.
For getcwd, you again have the issue of reliably detecting non-ASCII
if you use the ANSI function; if you use the Wide function, you again
have the choice of returning Unicode only if non-ASCII, or only if