[Python-Dev] PEP 277 (unicode filenames): please review

Martin v. Loewis martin@v.loewis.de
13 Aug 2002 00:50:51 +0200


Guido van Rossum <guido@python.org> writes:

> But shouldn't it return Unicode whenever there are filenames in the
> directory that can't represented as ASCII?

Unfortunately, on Windows, there is no way to find out: If you use the
ANSI function (which not only covers ASCII, but the full user's code
page), and you have a file name not representable in this code page,
the system returns a file name that contains question marks.

Of course, you could always use the Win32 Wide API (unicode) function,
and convert the pure-ASCII strings into byte strings. That gives a
number of options:
- always return Unicode for Unicode directory argument,
- return Unicode only for non-ASCII, and only for Unicode argument,
- return Unicode only for non-ASCII, regardless of Unicode argument,
- return Unicode only for non-MBCS (again depending or not depending
  on whether the argument is Unicode).

In the third case, if you have a non-representable file name, you
currently get a string like "??????.txt", whereas you then get
u"\uabcd\uefgh...txt". What might be worse: If the file name is
representable in "mbcs", yet outside ASCII, you currently get a "good"
byte string, and you get a Unicode string under option three.

So the MBCS options sound better. Unfortunately, testing whether a
string encodes as MBCS might be expensive.

> Hm, I don't know if I'd like os.listdir() to have an encoding
> argument.  Sounds like the wrong solution somehow.

I don't like that, either.

> > Oh yes, the same reasoning would hold for readlink(), getcwd() 
> > and any other call that returns filenames.
> 
> Ditto.

For readlink, if you trust FileSystemDefaultEncoding, you could return
a Unicode object if you find non-ASCII in the link contents.

For getcwd, you again have the issue of reliably detecting non-ASCII
if you use the ANSI function; if you use the Wide function, you again
have the choice of returning Unicode only if non-ASCII, or only if
non-MBCS.

Regards,
Martin