[Python-Dev] PEP 277 (unicode filenames): please review
Guido van Rossum
guido@python.org
Tue, 13 Aug 2002 05:41:54 -0400
> > Why is getting Unicode worse than getting MBCS? #3 looks right to me...
>
> If people do
>
> out = open("names.txt","w")
> for f in os.listdir("."):
> print >>out, f
>
> then this will print all filenames in mbcs. Under your proposed
> changed, it will raise a UnicodeError.
OK, you've convinced me. I guess the best compromise then is 8-bit
in, MBCS out, and Unicode in, Unicode out.
> > I still don't fully understand MBCS. I know there's a variable
> > assignment of codes to the upper half of the 8-bit space, based on a
> > user setting. But is that always a simply mapping to 128 non-ASCII
> > characters, or are there multi-byte codes that expand the total
> > character set to more than 256?
>
> Yes, the "mbcs" might be truly multibyte. Microsoft calls it the "ANSI
> code page", CP_ACP, which varies with the localization. They currently
> use:
>
> code page region encoding style
> 1250 Central Europe 8-bit
> 1251 Cyrillic 8-bit
> 1252 Western Europe 8-bit
> 1253 Greek 8-bit
> 1254 Turkish 8-bit
> 1255 Hebrew 8-bit
> 1256 Arabic 8-bit
> 1257 Baltic 8-bit
> 1258 Vietnamese 8-bit
>
> 874 Thai multi-byte
> 932 Japan Shift-JIS, multi-byte
> 936 Simplified Chinese GB2312, multi-byte
> 949 Korea multi-byte
> 950 Traditional Chinese BIG5, multi-byte
>
> The multi-byte codes fall in two categories: those that use bytes <128
> for multi-byte codes (e.g. 950) and those that don't (e.g. 932); the
> latter ones restrict themselves to bytes >=128 for multi-byte
> characters (I believe this is what the Shift in Shift-JIS tries to
> indicate).
Aha! So MBCS is not an encoding: it's an indirection for a variety of
encodings. (Is there a way to find out what the encoding is?)
> > > For readlink, if you trust FileSystemDefaultEncoding, you could return
> > > a Unicode object if you find non-ASCII in the link contents.
> >
> > What is FileSystemDefaultEncoding and when can you trust it?
>
> It's a global variable (really called Py_FileSystemDefaultEncoding),
> introduced by Mark Hammond, and should be set to the encoding that the
> operating system uses to encode file names, on the file system API.
>
> On Windows, this is reliably CP_ACP/"mbcs".
Do you mean that the condition on
#if defined(HAVE_LANGINFO_H) && defined(CODESET)
is reliably false on Windows? Otherwise _locale.setlocale() could set
it.
> On Unix, it is the locale's encoding by convention, which is set
> only if setlocale(LC_CTYPE,"") was called. Some Unix users may not
> follow the convention, or may have file names which cannot be
> represented in their locale's encoding.
So as long as they use 8-bit it's not our problem, right. Another
reason to avoid prodicing Unicode without a clue that the app expects
Unicode (alas). (BTW I find a Unicode argument to os.listdir() a
sufficient clue. IOW os.listdir(u".") should return Unicode.)
> > Wide + Unicode (if non-ASCII) sounds good to me. The fewer places an
> > app has to deal with MBCS the better, it seems to me.
>
> Ok, I'll update the PEP.
To what? (It would be bad if I convinced you at the same time you
convinced me of the opposite. :-)
> You may have been under the impression that MBCS is only relevant in
> Far East, so let me stress this point: It applies to all windows
> versions, e.g. a user of a French installation who has a file named
> C:\Docs\Boulot\SéminaireLORIA-jan2002\DemoCORBA (bug #509117)
> currently gets a byte string when listing C:\Docs\Boulot, but will
> get a Unicode string under the modified PEP 277.
No, I was aware of that part. I guess they should get MBCS on
os.listdir('C:\\Docs\\Boulot') but Unicode on
os.listdir(u'C:\\Docs\\Boulot').
--Guido van Rossum (home page: http://www.python.org/~guido/)