[Python-Dev] PEP 277 (unicode filenames): please review

Guido van Rossum guido@python.org
Mon, 12 Aug 2002 20:15:40 -0400

> Unfortunately, on Windows, there is no way to find out: If you use the
> ANSI function (which not only covers ASCII, but the full user's code
> page), and you have a file name not representable in this code page,
> the system returns a file name that contains question marks.
> Of course, you could always use the Win32 Wide API (unicode) function,
> and convert the pure-ASCII strings into byte strings. That gives a
> number of options:
> - always return Unicode for Unicode directory argument,
> - return Unicode only for non-ASCII, and only for Unicode argument,
> - return Unicode only for non-ASCII, regardless of Unicode argument,
> - return Unicode only for non-MBCS (again depending or not depending
>   on whether the argument is Unicode).
> In the third case, if you have a non-representable file name, you
> currently get a string like "??????.txt", whereas you then get
> u"\uabcd\uefgh...txt". What might be worse: If the file name is
> representable in "mbcs", yet outside ASCII, you currently get a "good"
> byte string, and you get a Unicode string under option three.

Why is getting Unicode worse than getting MBCS?  #3 looks right to me...

> So the MBCS options sound better. Unfortunately, testing whether a
> string encodes as MBCS might be expensive.

I still don't fully understand MBCS.  I know there's a variable
assignment of codes to the upper half of the 8-bit space, based on a
user setting.  But is that always a simply mapping to 128 non-ASCII
characters, or are there multi-byte codes that expand the total
character set to more than 256?

> For readlink, if you trust FileSystemDefaultEncoding, you could return
> a Unicode object if you find non-ASCII in the link contents.

What is FileSystemDefaultEncoding and when can you trust it?

> For getcwd, you again have the issue of reliably detecting non-ASCII
> if you use the ANSI function; if you use the Wide function, you again
> have the choice of returning Unicode only if non-ASCII, or only if
> non-MBCS.

Wide + Unicode (if non-ASCII) sounds good to me.  The fewer places an
app has to deal with MBCS the better, it seems to me.

--Guido van Rossum (home page: http://www.python.org/~guido/)