[Python-Dev] Import and unicode: part two

"Martin v. Löwis" martin at v.loewis.de
Mon Jan 24 21:28:58 CET 2011

Am 24.01.2011 16:39, schrieb Victor Stinner:
> Le lundi 24 janvier 2011 11:35:22, Stephen J. Turnbull a écrit :
>> ... VFAT-formatted file systems and Shift JIS file names ...
> I missed something: VFAT stores filenames as unicode (whereas FAT only 
> supports byte filenames). Well, VFAT stores filenames twice: as a 8+3 byte 
> strings and as a 255 unicode (UTF-16-LE) string (UTF-16-LE).

Stephen may not have meant VFAT. Instead, he might have meant FAT32,
or, more likely, exFAT. VFAT is patented by Microsoft, so vendors of
devices using flash memory cards often don't support VFAT.

In any case, file names are encoded in the OEM code page even on VFAT.

> On which OS do you access this VFAT file system? On Windows, you have two 
> APIs: bytes (*A) and wide character (*W). If you use the wide character, there 
> is explicit encoding at all.

Right ("no explicit encoding"). However, this is actually where things
can go wrong: Windows needs to guess the file system, and will guess it
uses the OEM code page. If the device writing the file system uses a
different OEM code age than the Windows installation reading it, you
get moji-bake.

This will actually happen with the *A APIs as well: they do *not* give
you the file name from disk. Instead, Windows converts the OEM
characters on disk to Unicode, and then the Unicode characters to the
ANSI code page.

> Linux has two mount options to control unicode on 
> a VFAT filesystem: "codepage" for the byte filenames (use Shift JIS here) and 
> "iocharset" for the unicode filenames (I don't understand this option). 
> Anyway, both systems support unicode filenames.

Linux doesn't support "unicode file names". Instead, it can support
UTF-8. As Oleg explains: you need one encoding for the bytes on disk
(to know what they mean, when converted to Unicode), and one encoding
to then convert the "abstract" unicode to bytes again to present to
the application. This is similar to how *A works on Windows.

The iocharset is needed even if the file system is known to use UTF-16
(say, NTFS, VFAT, or Joliet).


More information about the Python-Dev mailing list