[Python-Dev] PEP 277 (unicode filenames): please review

Tue, 13 Aug 2002 05:41:54 -0400

> > Why is getting Unicode worse than getting MBCS?  #3 looks right to me...
> 
> If people do
> 
> out = open("names.txt","w")
> for f in os.listdir("."):
>   print >>out, f
> 
> then this will print all filenames in mbcs. Under your proposed
> changed, it will raise a UnicodeError.

OK, you've convinced me.  I guess the best compromise then is 8-bit
in, MBCS out, and Unicode in, Unicode out.

> > I still don't fully understand MBCS.  I know there's a variable
> > assignment of codes to the upper half of the 8-bit space, based on a
> > user setting.  But is that always a simply mapping to 128 non-ASCII
> > characters, or are there multi-byte codes that expand the total
> > character set to more than 256?
> 
> Yes, the "mbcs" might be truly multibyte. Microsoft calls it the "ANSI
> code page", CP_ACP, which varies with the localization. They currently
> use:
> 
> code page region                 encoding style
> 1250      Central Europe         8-bit
> 1251      Cyrillic               8-bit
> 1252      Western Europe         8-bit
> 1253      Greek                  8-bit
> 1254      Turkish                8-bit
> 1255      Hebrew                 8-bit
> 1256      Arabic                 8-bit
> 1257      Baltic                 8-bit
> 1258      Vietnamese             8-bit
> 
> 874       Thai                   multi-byte
> 932       Japan                  Shift-JIS, multi-byte
> 936       Simplified Chinese     GB2312, multi-byte
> 949       Korea                  multi-byte
> 950       Traditional Chinese    BIG5, multi-byte
> 
> The multi-byte codes fall in two categories: those that use bytes <128
> for multi-byte codes (e.g. 950) and those that don't (e.g. 932); the
> latter ones restrict themselves to bytes >=128 for multi-byte
> characters (I believe this is what the Shift in Shift-JIS tries to
> indicate).

Aha!  So MBCS is not an encoding: it's an indirection for a variety of
encodings.  (Is there a way to find out what the encoding is?)

> > > For readlink, if you trust FileSystemDefaultEncoding, you could return
> > > a Unicode object if you find non-ASCII in the link contents.
> > 
> > What is FileSystemDefaultEncoding and when can you trust it?
> 
> It's a global variable (really called Py_FileSystemDefaultEncoding),
> introduced by Mark Hammond, and should be set to the encoding that the
> operating system uses to encode file names, on the file system API.
> 
> On Windows, this is reliably CP_ACP/"mbcs".

Do you mean that the condition on

#if defined(HAVE_LANGINFO_H) && defined(CODESET)

is reliably false on Windows?  Otherwise _locale.setlocale() could set
it.

> On Unix, it is the locale's encoding by convention, which is set
> only if setlocale(LC_CTYPE,"") was called. Some Unix users may not
> follow the convention, or may have file names which cannot be
> represented in their locale's encoding.

So as long as they use 8-bit it's not our problem, right.  Another
reason to avoid prodicing Unicode without a clue that the app expects
Unicode (alas).  (BTW I find a Unicode argument to os.listdir() a
sufficient clue.  IOW os.listdir(u".") should return Unicode.)

> > Wide + Unicode (if non-ASCII) sounds good to me.  The fewer places an
> > app has to deal with MBCS the better, it seems to me.
> 
> Ok, I'll update the PEP.

To what?  (It would be bad if I convinced you at the same time you
convinced me of the opposite. :-)

> You may have been under the impression that MBCS is only relevant in
> Far East, so let me stress this point: It applies to all windows
> versions, e.g. a user of a French installation who has a file named
> C:\Docs\Boulot\SéminaireLORIA-jan2002\DemoCORBA (bug #509117)
> currently gets a byte string when listing C:\Docs\Boulot, but will
> get a Unicode string under the modified PEP 277.

No, I was aware of that part.  I guess they should get MBCS on
os.listdir('C:\\Docs\\Boulot') but Unicode on
os.listdir(u'C:\\Docs\\Boulot').

--Guido van Rossum (home page: http://www.python.org/~guido/)