[Python-Dev] test_unicode_file failing on Mac OS X

Martin v. Löwis martin at v.loewis.de
Sun Dec 7 12:56:54 EST 2003

Jack Jansen <Jack.Jansen at cwi.nl> writes:

> This is probably related to the two flavors of unicode there are, one
> which prefers to have all accents separately from the letters as much
> as possible and one which prefers the reverse. I keep forgetting the
> names of the two, they're somewhat silly.

OS X uses what is called the "decomposed normal form", splitting
combined characters into the base character and the combining accent.

Python supports either form, but will use precomposed characters more
often than not.

> And while there are algorithms to convert the combined form of unicode
> to the uncombined form and vice versa there are no Python codecs to do
> this. 

Not as a codec, but as unicodedata.normalize. If you do

unicodedata.normalize(composed_string, "NFD")

you get the string that OS X wants you to use.

Of course, with Unicode-on-Windows, the story is mostly vice-versa.
NTFS/Win32 does not perform any normalization, so you can actually
store the precomposed and the decomposed string simultaneously in the
same directory (which is confusing). The platform codecs always
generate the precomposed form, though, so you are more likely to find
the precomposed form on disk.

For the test, it would be best to compare normal forms, and have the
test pass if the normal forms (NFD) are equal.


More information about the Python-Dev mailing list