Python Unicode handling wins again -- mostly

Fri Nov 29 19:44:13 EST 2013

There's a recent blog post complaining about the lousy support for 
Unicode text in most programming languages:

http://mortoray.com/2013/11/27/the-string-type-is-broken/

The author, Mortoray, gives nine basic tests to understand how well the 
string type in a language works. The first four involve "user-perceived 
characters", also known as grapheme clusters.

(1) Does the decomposed string "noe\u0308l" print correctly? Notice that 
the accented letter ë has been decomposed into a pair of code points, 
U+0065 (LATIN SMALL LETTER E) and U+0308 (COMBINING DIAERESIS).

Python 3.3 passes this test:

py> print("noe\u0308l")
noël

although I expect that depends on the terminal you are running in.

(2) If you reverse that string, does it give "lëon"? The implication of 
this question is that strings should operate on grapheme clusters rather 
than code points. Python fails this test:

py> print("noe\u0308l"[::-1])
leon

Some terminals may display the umlaut over the l, or following the l.

I'm not completely sure it is fair to expect a string type to operate on 
grapheme clusters (collections of decomposed characters) as the author 
expects. I think that is going above and beyond what a basic string type 
should be expected to do. I would expect a solid Unicode implementation 
to include support for grapheme clusters, and in that regard Python is 
lacking functionality.

(3) What are the first three characters? The author suggests that the 
answer should be "noë", in which case Python fails again:

py> print("noe\u0308l"[:3])
noe

but again I'm not convinced that slicing should operate across decomposed 
strings in this way. Surely the point of decomposing the string like that 
is in order to count the base character e and the accent "\u0308" 
separately?

(4) Likewise, what is the length of the decomposed string? The author 
expects 4, but Python gives 5:

py> len("noe\u0308l")
5

So far, Python passes only one of the four tests, but I'm not convinced 
that the three failed tests are fair for a string type. If strings 
operated on grapheme clusters, these would be good tests, but it is not a 
given that strings should.

The next few tests have to do with characters in the Supplementary 
Multilingual Planes, and this is where Python 3.3 shines. (In older 
versions, wide builds would also pass, but narrow builds would fail.)

(5) What is the length of "😸😾"?

Both characters U+1F636 (GRINNING CAT FACE WITH SMILING EYES) and U+1F63E 
(POUTING CAT FACE) are outside the Basic Multilingual Plane, which means 
they require more than two bytes each. Most programming languages using 
UTF-16 encodings internally (including Javascript and Java) fail this 
test. Python 3.3 passes:

py> s = '😸😾'
py> len(s)
2

(Older versions of Python distinguished between *narrow builds*, which 
used UTF-16 internally and *wide builds*, which used UTF-32. Narrow 
builds would also fail this test.)

This makes Python one of a very few programming languages which can 
easily handle so-called "astral characters" from the Supplementary 
Multilingual Planes while still having O(1) indexing operations.

(6) What is the substring after the first character? The right answer is 
a single character POUTING CAT FACE, and Python gets that correct:

py> unicodedata.name(s[1:])
'POUTING CAT FACE'

UTF-16 languages invariable end up with broken, invalid strings 
containing half of a surrogate pair.

(7) What is the reverse of the string? 

Python passes this test too:

py> print(s[::-1])
😾😸
py> for c in s[::-1]:
...     unicodedata.name(c)
...
'POUTING CAT FACE'
'GRINNING CAT FACE WITH SMILING EYES'

UTF-16 based languages typically break, again getting invalid strings 
containing surrogate pairs in the wrong order.

The next test involves ligatures. Ligatures are pairs, or triples, of 
characters which have been moved closer together in order to look better. 
Normally you would expect the type-setter to handle ligatures by 
adjusting the spacing between characters, but there are a few pairs (such 
as "fi" <=> "ﬁ" where type designers provided them as custom-designed 
single characters, and Unicode includes them as legacy characters.

(8) What's the uppercase of "baffle" spelled with an ffl ligature?

Like most other languages, Python 3.2 fails:

py> 'baﬄe'.upper()
'BAﬄE'

but Python 3.3 passes:

py> 'baﬄe'.upper()
'BAFFLE'

Lastly, Mortoray returns to noël, and compares the composed and 
decomposed versions of the string:

(9) Does "noël" equal "noe\u0308l"?

Python (correctly, in my opinion) reports that they do not:

py> "noël" == "noe\u0308l"
False

Again, one might argue whether a string type should report these as equal 
or not, I believe Python is doing the right thing here. As the author 
points out, any decent Unicode-aware language should at least offer the 
ability to convert between normalisation forms, and Python passes this 
test:

py> unicodedata.normalize("NFD", "noël") == "noe\u0308l"
True
py> "noël" == unicodedata.normalize("NFC", "noe\u0308l")
True

Out of the nine tests, Python 3.3 passes six, with three tests being 
failures or dubious. If you believe that the native string type should 
operate on code-points, then you'll think that Python does the right 
thing. If you think it should operate on grapheme clusters, as the author 
of the blog post does, then you'll think Python fails those three tests.

A call to arms
==============

As the Unicode Consortium itself acknowledges, sometimes you want to 
operate on an array of code points, and sometimes on an array of 
graphemes ("user-perceived characters"). Python 3.3 is now halfway there, 
having excellent support for code-points across the entire Unicode 
character set, not just the BMP.

The next step is to provide either a data type, or a library, for working 
on grapheme clusters. The Unicode Consortium provides a detailed 
discussion of this issue here:

http://www.unicode.org/reports/tr29/

If anyone is looking for a meaty project to work on, providing support 
for grapheme clusters could be it. And if not, hopefully you've learned 
something about Unicode and the limitations of Python's Unicode support.

-- 
Steven