[Pythonmac-SIG] Filename encodings on the Mac

Mon, 9 Jul 2001 16:59:44 -0700

On Sunday, July 8, 2001, at 01:56 PM, Jack Jansen wrote:

> But there's still a problem with the multibyte system fonts, I think.
> If MacPython knows there's no Python unicode codec for the current
> encoding it pretends that 8bit characters are MacRoman. So, passing a
> correct unicode Japanese filename to open() will cause it to fail if
> there are non-ascii characters in there: the Python unicode->macroman
> converter will complain that the characters are not available in the
> macroman set. Returning MacRoman is my guess, the alternative is
> returning "ascii", which will only allow 7bit characters. If people
> using multibyte systems (or single-byte systems for an encoding for
> which no Python unicode codec yet exists) feel that returning ascii
> would be a better idea: let me know. Or better, let's discuss this on
> the mailing list.

7-bit ASCII is a subset of all of the Mac encodings.  And ASCII is 
represented the same way in all Mac encodings (as the single byte values 
0-127).  So, if you can convert the Unicode to 7-bit ASCII, you can pass 
that ASCII into the open() call regardless of the current default 
encoding or font.

Using Unicode->MacRoman when the current encoding isn't MacRoman can 
lead to gibberish in the name, or errors from open().  Byte values 
128-255 mean different things in different Mac encodings.  In MacRoman, 
they are individual non-ASCII characters.  In other encodings, they may 
be invalid (the other encodings don't always use all possible byte 
values), they may be different characters than the MacRoman characters 
with the same byte value, or they may be a part of a two-byte character.

If you don't have a codec for the current encoding, and the Unicode 
won't convert to 7-bit ASCII, then you should probably raise an error 
(rather than generate the wrong filename).

> It must be possible to create a multibyte MacJapanese <-> Unicode
> codec with the Python unicode infrastructure: after all there's a
> utf-8 codec too, and that's also a multibyte encoding. But I'm
> completely out of my water here. If someone wants to create one and
> contribute it I'll gladly try and have it incorporated in the standard
> distribution, and I can put people into contact with the Python
> unicode gurus, but that's about as much as I can promise.

Yeah, it should be possible.  Note that a UTF-8 <-> UTF-16 conversion 
can be a very simple algorithmic conversion that doesn't require you to 
actually understand the characters being converted.  And if I remember 
correctly, anything that can be represented in UTF-16 can be represented 
in UTF-8, and vice versa.

Unfortunately, I only know enough about Unicode to be dangerous (and to 
call the Mac OS Unicode Converter from inside the File Manager).  I 
could probably supply some code snippets to show one way to call the 
Unicode Converter.  And I could probably put people in contact with the 
Apple folks who do the Unicode Converter and Text Encoding Converter 
(maybe even facilitate getting some tables for the conversions).  But I 
haven't looked into the Python Unicode stuff at all.

-Mark