[Python-Dev] PEP 277 (unicode filenames): please review
Martin v. Loewis
martin@v.loewis.de
13 Aug 2002 08:51:28 +0200
Guido van Rossum <guido@python.org> writes:
> Why is getting Unicode worse than getting MBCS? #3 looks right to me...
If people do
out =3D open("names.txt","w")
for f in os.listdir("."):
print >>out, f
then this will print all filenames in mbcs. Under your proposed
changed, it will raise a UnicodeError.
> I still don't fully understand MBCS. I know there's a variable
> assignment of codes to the upper half of the 8-bit space, based on a
> user setting. But is that always a simply mapping to 128 non-ASCII
> characters, or are there multi-byte codes that expand the total
> character set to more than 256?
Yes, the "mbcs" might be truly multibyte. Microsoft calls it the "ANSI
code page", CP_ACP, which varies with the localization. They currently
use:
code page region encoding style
1250 Central Europe 8-bit
1251 Cyrillic 8-bit
1252 Western Europe 8-bit
1253 Greek 8-bit
1254 Turkish 8-bit
1255 Hebrew 8-bit
1256 Arabic 8-bit
1257 Baltic 8-bit
1258 Vietnamese 8-bit
874 Thai multi-byte
932 Japan Shift-JIS, multi-byte
936 Simplified Chinese GB2312, multi-byte
949 Korea multi-byte
950 Traditional Chinese BIG5, multi-byte
The multi-byte codes fall in two categories: those that use bytes <128
for multi-byte codes (e.g. 950) and those that don't (e.g. 932); the
latter ones restrict themselves to bytes >=3D128 for multi-byte
characters (I believe this is what the Shift in Shift-JIS tries to
indicate).
> > For readlink, if you trust FileSystemDefaultEncoding, you could return
> > a Unicode object if you find non-ASCII in the link contents.
>=20
> What is FileSystemDefaultEncoding and when can you trust it?
It's a global variable (really called Py_FileSystemDefaultEncoding),
introduced by Mark Hammond, and should be set to the encoding that the
operating system uses to encode file names, on the file system API.
On Windows, this is reliably CP_ACP/"mbcs". On Unix, it is the
locale's encoding by convention, which is set only if
setlocale(LC_CTYPE,"") was called. Some Unix users may not follow the
convention, or may have file names which cannot be represented in
their locale's encoding.
> Wide + Unicode (if non-ASCII) sounds good to me. The fewer places an
> app has to deal with MBCS the better, it seems to me.
Ok, I'll update the PEP.
You may have been under the impression that MBCS is only relevant in
Far East, so let me stress this point: It applies to all windows
versions, e.g. a user of a French installation who has a file named
C:\Docs\Boulot\S=E9minaireLORIA-jan2002\DemoCORBA (bug #509117)
currently gets a byte string when listing C:\Docs\Boulot, but will
get a Unicode string under the modified PEP 277.
Regards,
Martin