[Python-Dev] PEP 277 (unicode filenames): please review

13 Aug 2002 08:51:28 +0200

Guido van Rossum <guido@python.org> writes:

> Why is getting Unicode worse than getting MBCS?  #3 looks right to me...

If people do

out =3D open("names.txt","w")
for f in os.listdir("."):
  print >>out, f

then this will print all filenames in mbcs. Under your proposed
changed, it will raise a UnicodeError.

> I still don't fully understand MBCS.  I know there's a variable
> assignment of codes to the upper half of the 8-bit space, based on a
> user setting.  But is that always a simply mapping to 128 non-ASCII
> characters, or are there multi-byte codes that expand the total
> character set to more than 256?

Yes, the "mbcs" might be truly multibyte. Microsoft calls it the "ANSI
code page", CP_ACP, which varies with the localization. They currently
use:

code page region                 encoding style
1250      Central Europe         8-bit
1251      Cyrillic               8-bit
1252      Western Europe         8-bit
1253      Greek                  8-bit
1254      Turkish                8-bit
1255      Hebrew                 8-bit
1256      Arabic                 8-bit
1257      Baltic                 8-bit
1258      Vietnamese             8-bit

874       Thai                   multi-byte
932       Japan                  Shift-JIS, multi-byte
936       Simplified Chinese     GB2312, multi-byte
949       Korea                  multi-byte
950       Traditional Chinese    BIG5, multi-byte

The multi-byte codes fall in two categories: those that use bytes <128
for multi-byte codes (e.g. 950) and those that don't (e.g. 932); the
latter ones restrict themselves to bytes >=3D128 for multi-byte
characters (I believe this is what the Shift in Shift-JIS tries to
indicate).

> > For readlink, if you trust FileSystemDefaultEncoding, you could return
> > a Unicode object if you find non-ASCII in the link contents.
>=20
> What is FileSystemDefaultEncoding and when can you trust it?

It's a global variable (really called Py_FileSystemDefaultEncoding),
introduced by Mark Hammond, and should be set to the encoding that the
operating system uses to encode file names, on the file system API.

On Windows, this is reliably CP_ACP/"mbcs". On Unix, it is the
locale's encoding by convention, which is set only if
setlocale(LC_CTYPE,"") was called. Some Unix users may not follow the
convention, or may have file names which cannot be represented in
their locale's encoding.

> Wide + Unicode (if non-ASCII) sounds good to me.  The fewer places an
> app has to deal with MBCS the better, it seems to me.

Ok, I'll update the PEP.

You may have been under the impression that MBCS is only relevant in
Far East, so let me stress this point: It applies to all windows
versions, e.g. a user of a French installation who has a file named
C:\Docs\Boulot\S=E9minaireLORIA-jan2002\DemoCORBA (bug #509117)
currently gets a byte string when listing C:\Docs\Boulot, but will
get a Unicode string under the modified PEP 277.

Regards,
Martin