[Pythonmac-SIG] Filename encodings on the Mac

Mark Day mday@mac.com
Sat, 07 Jul 2001 18:15:44 -0700


on 7/4/01 3:08 PM, Jack Jansen at jack@oratrix.nl wrote:

> In the process of getting better unicode support I'm now looking at
> converting unicode strings to 8-bit strings correctly for filenames.
> 
> Python has hooks to do this, but I must say I'm rather unsure how
> MacOS handles encodings. If I create a disk on a macroman system and
> then bring it to a macgreek system, what will happen to my filenames?
> Is it only the current script/language setting that influences how
> filenames are interpreted, or does the disk (filename?) actually store
> the fact that it was created on a macroman system?

HFS volumes store filenames as a sequence of one to 31 bytes.  There is no
encoding stored with the name.  How those bytes are interpretted is up to
the application.

HFS Plus volumes store filenames as a sequence of one to 255 16-bit Unicode
characters.  The catalog record contains a text encoding hint that suggests
a Mac text encoding that will preserve the filename bytes when using the HFS
APIs (eg., HCreate and PBGetCatInfo).

The File Manager maintains a default text encoding which it uses as a
default when it needs to convert input names to or from Unicode.  This
default encoding is usually the text encoding associated with the Finder's
view font, but Installer can change the default encoding when installing a
localized OS (so that the encodings stored on an HFS Plus volume are
correct).

Suppose you're running Mac OS 9 or earlier, and Finder's view font is a
MacRoman font (such as Geneva).  Create or rename a file or folder in the
Finder.  On an HFS volume, those characters you typed are converted to bytes
using the MacRoman encoding; those bytes are stored in the catalog.  On an
HFS Plus volume, those bytes get converted to Unicode using the MacRoman
encoding, and the text encoding hint is set to MacRoman.  Note that MacRoman
uses one byte per character, and all 256 byte values represent a valid
character in the encoding.

Now suppose you change the view font to Osaka.  The text encoding associated
with Osaka is MacJapanse, so the File Manager's default encoding is set to
MacJapanese.  When the Finder displays an HFS filename, the string of bytes
on disk are returned unchanged by PBGetCatInfo, and Finder displays them as
if they are in MacJapanese encoding (regardless of the encoding used when
the name was created or renamed!).  On an HFS Plus volume, the PBGetCatInfo
call converts the Unicode on disk to a string of bytes based on the text
encoding stored in that catalog record, and Finder displays those bytes as
if they are in MacJapanese.  Finder in Mac OS 9 and earlier does not use the
HFS Plus APIs except for copying files, so it doesn't know the text encoding
hint associated with the name.

Note that MacJapanese uses either one or two bytes per character, and that
some sequences of bytes do not result in valid MacJapanese characters.  As
it happens, plain old ASCII is a subset of all of the Mac encodings.  If you
created a name using MacRoman, and it contains non-ASCII characters (such as
accented characters), those non-ASCII characters may or may not result in
valid (meaningful) Japanese characters.  Typcially the non-ASCII character
and the following (MacRoman) character are converted to a single (two-byte)
MacJapanese character.

If you created the filename using MacJapanese, the MacJapanese characters
would be converted to a sequence of bytes on HFS, or to Unicode and a
MacJapanese text encoding hint.  If you switch to MacRoman, the File Manager
returns the same sequence of bytes, and non-ASCII characters from the
MacJapanese name get displayed as two characters (the first one being
non-ASCII).

The Mac OS X Finder uses the HFS Plus APIs (eg., FSGetCatalogInfo), so it
uses Unicode when calling the File Manager or displaying names.  On HFS Plus
volumes, that Unicode is passed back and forth essentially unchanged
(Unicode allows the same string to be encoded using different sequences of
Unicode code points, and HFS Plus always stores names on disk in one
canonical form).  On HFS volumes, those Unicode strings are converted to a
(single) Mac encoding when being stored, and converted from a Mac encoding
to Unicode when coming back from the File Manager.

Remember that a Unicode string can contain characters from multiple
languages, and it may be impossible to convert it to a string in any single
Mac encoding.

Hope this helps.  If you have more questions (and I'm sure you will!), let
me know.

-Mark