[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Tom Christiansen report at bugs.python.org
Sat Aug 27 13:51:49 CEST 2011

Tom Christiansen <tchrist at perl.com> added the comment:

Guido van Rossum <report at bugs.python.org> wrote
   on Sat, 27 Aug 2011 03:26:21 -0000: 

> To me, making (default) iteration deviate from indexing is anathema.

So long is there's a way to interate through a string some other way
that by code unit, that's fine.  However, the Java way of 16-bit code
units is so annoying because there often aren't code point APIs, and 
so you get a lot of niggling errors creeping in.  This is part of why
I strongly prefer wide builds, so that code point and code unit are the
same thing again.

> However, there is nothing wrong with providing a library function that
> takes a string and returns an iterator that iterates over code points,
> joining surrogate pairs as needed. You could even have one that
> iterates over characters (I think Tom calls them graphemes), if that
> is well-defined and useful.

"Character" can sometimes be a confusing term when it means something
different to us programmers as it does to users.  Code point to mean the
integer is a lot clearer to us but to no one else.  At work I often just
give in and go along with the crowd and say character for the number that
sits in a char or wchar_t or Character variable, even though of course
that's a code point.  I only rebel when they start calling code units 
characters, which (inexperienced) Java people tend to do, because that
leads to surrogate splitting and related errors.

By grapheme I mean something the user perceives as a single character.  In
full Unicodese, this is an extended grapheme cluster.  These are code point
sequences that start with a grapheme base and have zero or more grapheme
extenders following it.  For our purposes, that's *mostly* like saying you
have a non-Mark followed by any number of Mark code points, the main
excepting being that a CR followed by a LF also counts as a single grapheme
in Unicode.

If you are in an editor and wanted to swap two "characters", the one 
under the user's cursor and the one next to it, you have to deal with
graphemes not individual code points, or else you'd get the wrong answer.
Imagine swapping the last two characters of the first string below,
or the first two characters of second one:

    contrôlée    contro\x{302}le\x{301}e
    élève        e\x{301}le\x{300}ve        

While you can sometimes fake a correct answer by considering things
in NFC not NFD, that's doesn't work in the general case, as there
are only a few compatibility glyphs for round-tripping for legacy
encodings (like ISO 8859-1) compared with infinitely many combinations
of combining marks.  Particularly in mathematics and in phonetics, 
you often end up using marks on characters for which no pre-combined
variant glyph exists.  Here's the IPA for a couple of Spanish words
with their tight (phonetic, not phonemic) transcriptions:

        anécdota    [a̠ˈne̞ɣ̞ð̞o̞t̪a̠]
        rincón      [rĩŋˈkõ̞n]

        ane\x{301}cdota    [a\x{320}\x{2C8}ne\x{31E}\x{263}\x{31E}\x{F0}\x{31E}o\x{31E}t\x{32A}a\x{320}]
        rinco\x{301}n      [ri\x{303}\x{14B}\x{2C8}ko\x{31E}\x{303}n]

        an\x{E9}cdota    [a\x{320}\x{2C8}ne\x{31E}\x{263}\x{31E}\x{F0}\x{31E}o\x{31E}t\x{32A}a\x{320}]
        rinc\x{F3}n      [r\x{129}\x{14B}\x{2C8}k\x{F5}\x{31E}n]

So combining marks don't "just go away" in NFC, and you really do have to
deal with them.  Notice that to get the tabs right (your favorite subject :),
you have to deal with print widths, which is another place that you get
into trouble if you only count code points.

BTW, did you know that the stress mark used in the phonetics above
is actually a (modifier) letter in Unicode, not punctuation?

    # uniprops -a 2c8
        \w \pL \p{L_} \p{Lm}
    All Any Alnum Alpha Alphabetic Assigned InSpacingModifierLetters Case_Ignorable CI Common Zyyy Dia Diacritic L Lm Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Modifier_Letter Print Spacing_Modifier_Letters Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word
    Age=1.1 Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=Spacing_Modifier_Letters Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=BB Line_Break=Break_Before LB=BB Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE Word_Break=LE _Case_Ignorable _X_Begin

That means those would all be matched by \w+, as unlike \p{alpha},
\p{word} includes not just \pL etc but also all the combining marks.
That's how you want it to work, although I think you have to use
regex not re in Python to get that.

Iterating by grapheme is easy in a regex engine that supports \X.
Instead of using "." to match a code point, you use a \X to match
a grapheme.  So the swapping problem goes away, and many others.
To capture a pair of graphemes for swapping you'd use (\X)(\X), and
to grab the first 6 graphemes without breaking them up you'd use \X{6}.
That means to interate by grpaheme you just split up your string one
\X at a time.

Here's a real-world example:

In the vim editor, when you're editing UTF-8 as I am this mail message,
because it is all about user-perceived characters, they actually use "." to
match an entire grapheme.  This is different form th eayw perl and
everybody else uses "." for a code point, not a grapheme.  If I did s/^.//
or s/.$// in vim, I would need s/^\X// or s/\X$// for in perl.  Similarly,
to swap "characters" with the "xp" command, it will grab the entire \X.
Put some of those phonetic transcriptions above into a vim buffer and play
with them to see what I mean.

Imagine using a format like "%-6.6s" on "contrôlée": that should produce
"contrô" not "contro".  That's because code points with the property
Bidi_Class=Non_Spacing_Mark (BC=NSM) do not advance the cursor, they just
stack up.

It gets even worse in that some code points advance the cursor by two
not by zero or one.  These include those with the East_Asian_Width
property value Full or Wide.  And they aren't always Asian characters,
either.  For example, these code points all have the EA=W property, so
take up to print columns:

     〃  U+3003 DITTO MARK
     〜  U+301C WAVE DASH

Perl's built-in string indexing, and hence its substrings, is strictly 
by code point and not by grapheme.  This is really frustrating at times,
because something like this screws up:

    printf "%-6.6", "contrôlée";
    printf "%-6.6", "a̠ˈne̞ɣ̞ð̞o̞t̪a̠";

Logically, those should produce "contrô" and "a̠ˈne̞ɣ̞ð̞", but of course
when considering only code points, they won't.  Well, not unless the 
1st is in NFC, but there's no hope for the second.

Perl does have a grapheme cluster string class which provides a way 
to figure out the columns and also allows for substring operation by
grapheme. But it is not at all integrated into anything, which makes 
it tedious to use.

    use Unicode::GCString;  # on CPAN only, not yet in core

    my $string   = "a̠ˈne̞ɣ̞ð̞o̞t̪a̠";
    my $gcstring = Unicode::GCString->new($string);
    my $colwidth = $gcstring->columns;
    if ($colwidth > 6) {
        print $gcstring->substr(0,6);
    } else {
        print " " x (6 - $colwidth);
        print $gcstring;

Isn't that simply horrible?  You *will* get the right answer that way, but
what a pain!  Really, there needs to be a way for the built-in formatters
to understand graphemes.  But first, I think, you have to have the regex
engine understand them.  Matthew's regex does, because it supports \X.

There's a lot more to dealing with Unicode text than just extending the
character repertoire.  How much should fundamental to the language and how
much should be relegated to modules isn't always clear.  I do know I've had
to rewrite a *lot* of standard Unix tools to deal with Unicode properly.
For the wc(1) rewrite I only needed to consider graphemes with \X and 
Unicode line break sequences with \R, but other tools need better smarts.
For example, just getting the fmt(1) rewrite to wrap lines in paragraphs 
correctly requires understanding not just graphemes but the Unicode 
Linebreak Algorithm, which in turn relies upon understanding the print
widths for grapheme cluster strings and East Asian wide or full characters.

It's something you only want to do once and never think about again. :(



Python tracker <report at bugs.python.org>

More information about the Python-bugs-list mailing list