[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
report at bugs.python.org
Sat Aug 13 01:04:54 CEST 2011
Tom Christiansen <tchrist at perl.com> added the comment:
"Terry J. Reedy" <report at bugs.python.org> wrote
on Fri, 12 Aug 2011 22:21:59 -0000:
> Does the regex module handle these particular issues better?
No, it currently does not. One would have to ask Matthew directly, but I
believe it was because he was trying to stay compatible with re, sometimes
apparently even if that means being bug compatible. I have brought it
to his attention though, and at last report he was pondering the matter.
In contrast to how Python behaves on narrow builds, even though Java uses
UTF-16 for its internal representation of strings, its Java Pattern is
quite adamant about treating with logical code points alone. Besides
running afoul of tr18, it is senseless to do otherwise. A dot is one
Unicode code point, no matter whether you have 8-bit code units, 16-bit
code units, or 32-bit code units. Similarly, character classes and their
negations only match entire code points, never pieces of the same.
ICU's regexes work the same way the normal Java Pattern library does.
So too do Perl, Ruby, and Go. Python is really the odd man out here.
One interesting counterexample is the vim editor. It has dot match a
complete grapheme no matter how many code points that requires, because
we're dealing with user-visible characters now, not programmer-visible one.
It is an unreasonable burden to make the programmer deal with the
fine-grained details of low-level serialization schemes instead of at
least(*) the code point level of operations, which is the minimum for
getting real work done. (*Note that tr18 admits that accessing text at the
code point level meets only programmer expectations, not those of the user,
and therefore to meet user expectations much more elaborate patterns must
necessarily be constructed than if logical groups of coarser granularity
than code points alone are supported.)
Python should not be subject to changing its behavior from one build to the
next. This astonishing narrow-vs-wide build behavior makes it virtually
impossible to write portable code to work on arbitrary Unicode text. You
cannot even know whether you need to match one dot or two to get a single
code point, and similarly for character indexing, etc. Even identifiers
come into play. Surrogates should be utterly nonexistent/invisible at
this, the normal level of operation.
An API that minimally but uniformly deals with logical code points and
nothing finer in granularity is the only way to go here. Please trust me
on this one. Graphemes (tr18 Level 2) and collation elements (Level 3)
will someday build on that, but one must first support code points
properly. That's why it's a Level 1 requirement.
Python tracker <report at bugs.python.org>
More information about the Python-bugs-list