Python Unicode handling wins again -- mostly
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Fri Nov 29 19:44:13 EST 2013
There's a recent blog post complaining about the lousy support for
Unicode text in most programming languages:
http://mortoray.com/2013/11/27/the-string-type-is-broken/
The author, Mortoray, gives nine basic tests to understand how well the
string type in a language works. The first four involve "user-perceived
characters", also known as grapheme clusters.
(1) Does the decomposed string "noe\u0308l" print correctly? Notice that
the accented letter ë has been decomposed into a pair of code points,
U+0065 (LATIN SMALL LETTER E) and U+0308 (COMBINING DIAERESIS).
Python 3.3 passes this test:
py> print("noe\u0308l")
noël
although I expect that depends on the terminal you are running in.
(2) If you reverse that string, does it give "lëon"? The implication of
this question is that strings should operate on grapheme clusters rather
than code points. Python fails this test:
py> print("noe\u0308l"[::-1])
leon
Some terminals may display the umlaut over the l, or following the l.
I'm not completely sure it is fair to expect a string type to operate on
grapheme clusters (collections of decomposed characters) as the author
expects. I think that is going above and beyond what a basic string type
should be expected to do. I would expect a solid Unicode implementation
to include support for grapheme clusters, and in that regard Python is
lacking functionality.
(3) What are the first three characters? The author suggests that the
answer should be "noë", in which case Python fails again:
py> print("noe\u0308l"[:3])
noe
but again I'm not convinced that slicing should operate across decomposed
strings in this way. Surely the point of decomposing the string like that
is in order to count the base character e and the accent "\u0308"
separately?
(4) Likewise, what is the length of the decomposed string? The author
expects 4, but Python gives 5:
py> len("noe\u0308l")
5
So far, Python passes only one of the four tests, but I'm not convinced
that the three failed tests are fair for a string type. If strings
operated on grapheme clusters, these would be good tests, but it is not a
given that strings should.
The next few tests have to do with characters in the Supplementary
Multilingual Planes, and this is where Python 3.3 shines. (In older
versions, wide builds would also pass, but narrow builds would fail.)
(5) What is the length of "😸😾"?
Both characters U+1F636 (GRINNING CAT FACE WITH SMILING EYES) and U+1F63E
(POUTING CAT FACE) are outside the Basic Multilingual Plane, which means
they require more than two bytes each. Most programming languages using
UTF-16 encodings internally (including Javascript and Java) fail this
test. Python 3.3 passes:
py> s = '😸😾'
py> len(s)
2
(Older versions of Python distinguished between *narrow builds*, which
used UTF-16 internally and *wide builds*, which used UTF-32. Narrow
builds would also fail this test.)
This makes Python one of a very few programming languages which can
easily handle so-called "astral characters" from the Supplementary
Multilingual Planes while still having O(1) indexing operations.
(6) What is the substring after the first character? The right answer is
a single character POUTING CAT FACE, and Python gets that correct:
py> unicodedata.name(s[1:])
'POUTING CAT FACE'
UTF-16 languages invariable end up with broken, invalid strings
containing half of a surrogate pair.
(7) What is the reverse of the string?
Python passes this test too:
py> print(s[::-1])
😾😸
py> for c in s[::-1]:
... unicodedata.name(c)
...
'POUTING CAT FACE'
'GRINNING CAT FACE WITH SMILING EYES'
UTF-16 based languages typically break, again getting invalid strings
containing surrogate pairs in the wrong order.
The next test involves ligatures. Ligatures are pairs, or triples, of
characters which have been moved closer together in order to look better.
Normally you would expect the type-setter to handle ligatures by
adjusting the spacing between characters, but there are a few pairs (such
as "fi" <=> "fi" where type designers provided them as custom-designed
single characters, and Unicode includes them as legacy characters.
(8) What's the uppercase of "baffle" spelled with an ffl ligature?
Like most other languages, Python 3.2 fails:
py> 'baffle'.upper()
'BAfflE'
but Python 3.3 passes:
py> 'baffle'.upper()
'BAFFLE'
Lastly, Mortoray returns to noël, and compares the composed and
decomposed versions of the string:
(9) Does "noël" equal "noe\u0308l"?
Python (correctly, in my opinion) reports that they do not:
py> "noël" == "noe\u0308l"
False
Again, one might argue whether a string type should report these as equal
or not, I believe Python is doing the right thing here. As the author
points out, any decent Unicode-aware language should at least offer the
ability to convert between normalisation forms, and Python passes this
test:
py> unicodedata.normalize("NFD", "noël") == "noe\u0308l"
True
py> "noël" == unicodedata.normalize("NFC", "noe\u0308l")
True
Out of the nine tests, Python 3.3 passes six, with three tests being
failures or dubious. If you believe that the native string type should
operate on code-points, then you'll think that Python does the right
thing. If you think it should operate on grapheme clusters, as the author
of the blog post does, then you'll think Python fails those three tests.
A call to arms
==============
As the Unicode Consortium itself acknowledges, sometimes you want to
operate on an array of code points, and sometimes on an array of
graphemes ("user-perceived characters"). Python 3.3 is now halfway there,
having excellent support for code-points across the entire Unicode
character set, not just the BMP.
The next step is to provide either a data type, or a library, for working
on grapheme clusters. The Unicode Consortium provides a detailed
discussion of this issue here:
http://www.unicode.org/reports/tr29/
If anyone is looking for a meaty project to work on, providing support
for grapheme clusters could be it. And if not, hopefully you've learned
something about Unicode and the limitations of Python's Unicode support.
--
Steven
More information about the Python-list
mailing list