[Python-Dev] fun with unicode, part 1

Neil Hodgson nhodgson@bigpond.net.au
Tue, 2 May 2000 18:22:36 +1000


> > I'd like to solve this problem, but I have some questions: what *IS*
> > the encoding used for filenames on Windows?  This may differ per
> > Windows version; perhaps it can differ drive letter?  Or per
> > application or per thread?  On Windows NT, filenames are supposed to
> > be Unicode.  (I suppose also on Windowns 2000?)  How do I open a file
> > with a given Unicode string for its name, in a C program?  I suppose
> > there's a Win32 API call for that which has a Unicode variant.

   Its decided by each file system.

   For FAT file systems, the OEM code page is used. The OEM code page
generally used in the United States is code page 437 which is different from
the code page windows uses for display. I had to deal with this in a system
where people used fractions (1/4, 1/2 and 3/4) as part of names which had to
be converted into valid file names. For example 1/4 is 0xBC for display but
0xAC when used in a file name.

   In Japan, I think different manufacturers used different encodings with
NEC trying to maintain market control with their own encoding.

   VFAT stores both Unicode long file names and shortened aliases. However
the Unicode variant is hard to get to from Windows 95/98.

   NTFS stores Unicode.

> > On Windows 95/98, the Unicode variants of the Win32 API calls don't
> > exist.  So what is the poor Python runtime to do there?

   Fail the call. All existing files can be opened because they have short
non-Unicode aliases. If a file with a Unicode name can not be created
because the OS doesn't support it then you should give up. Just as you
should give up if you try to save a file with a name that includes a
character not allowed by the file system.

> > Can Japanese people use Japanese characters in filenames on Windows
> > 95/98?

   Yes.

> > Let's assume they can.  Since the filesystem isn't Unicode
> > aware, the filenames must be encoded.  Which encoding is used?  Let's
> > assume they use Microsoft's multibyte encoding.  If they put such a
> > file on a floppy and ship it to Linköping, what will Fredrik see as
> > the filename?  (I.e., is the encoding fixed by the disk volume, or by
> > the operating system?)

   If Fredrik is running a non-Japanese version of Windows 9x, he will see
some 'random' western characters replacing the Japanese.

   Neil