[Python-Dev] Unicode Imports

Sat Sep 9 23:22:10 CEST 2006

Martin v. Löwis wrote:
> David Hopwood schrieb:
> 
>>On Windows, file system pathnames can contain arbitrary Unicode characters
>>(well, almost). Despite the existence of "ANSI" filesystem APIs, and
>>regardless of what 'sys.getfilesystemencoding()' returns, the underlying
>>file system encoding for NTFS and FAT filesystems is UTF-16LE.
>>
>>Thus, either:
>> - the fact that sys.getfilesystemencoding() returns a non-Unicode encoding
>>   on Windows is a bug, or
>> - any program that relies on sys.getfilesystemencoding() being able to
>>   encode arbitrary Windows pathnames has a bug.
>>
>>We need to decide which of these is the case.
> 
> There is a third option:
> - the operating system has a bug

This behaviour is by design. If it is a bug, then it is a "won't ever fix --
no way, no how" bug, that Python must accomodate if it is to properly support
Unicode on Windows.

> It is actually this option that rules out the other two.
> sys.getfilesystemencoding() returns "mbcs" on Windows, which means
> CP_ACP. The file system encoding is an encoding that converts a
> file name into a byte string. Unfortunately, on Windows, there are
> file names which cannot be converted into a byte string in a standard
> manner. This is an operating system bug (or mis-design; they should
> have chosen UTF-8 as the byte encoding of file names, instead of
> making it depend on the system locale, but they of course did so
> for backwards compatibility with Windows 3.1 and 9x).

Although UTF-8 was invented (in September 1992) technically before the release
of the first version of NT supporting NTFS (NT 3.1 in July 1993), it had not
been invented before the decision to use Unicode in NTFS, or in Windows NT's
file APIs, had been made.

(I believe OS/2 HPFS had not supported Unicode, even though NTFS was otherwise
almost identical to it.)

At that time, the decision to use Unicode at all was quite forward-looking;
the final version of Unicode 1.0 had only been published in June 1992
(although it had been approved earlier; see <http://www.unicode.org/history/>).

UTF-8 was only officially added to the Unicode standard in an appendix of
Unicode 2.0 (published July 1996), and only given essentially equal status to
UTF-16 and UTF-32 in Unicode 3.0 (September 1999).

> As a side note: every encoding in Python is a Unicode encoding;
> so there aren't any "non-Unicode encodings".

It was clear from context that I meant "encoding capable of representing
all Unicode characters".

> Programs that rely on sys.getfilesystemencoding() being able to
> represent arbitrary file names on Windows might have a bug;
> programs that rely on sys.getfilesystemencoding() being able
> to encode all elements of sys.path do not (at least not for
> Python 2.5 and earlier).

Elements of sys.path can be Unicode strings in Python 2.5, and should be
pathnames supported by the underlying OS. Where is it documented that there
is any further restriction on them? And why should there be any further
restriction on them?

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>