[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Tom Christiansen report at bugs.python.org
Sun Aug 14 18:54:49 CEST 2011

Tom Christiansen <tchrist at perl.com> added the comment:

Ezio Melotti <report at bugs.python.org> wrote
   on Sun, 14 Aug 2011 07:15:09 -0000:

>> Unicode says you can't put surrogates or noncharacters in a
>> UTF-anything stream.  It's a bug to do so and pretend it's a
>> UTF-whatever.

> The UTF-8 codec described by RFC 2279 didn't say so, so, since our
> codec was following RFC 2279, it was producing valid UTF-8.  With RFC
> 3629 a number of things changed in a non-backward compatible way.
> Therefore we couldn't just change the behavior of the UTF-8 codec nor
> rename it to something else in Python 2.  We had to wait till Python 3
> in order to fix it.

I'm a bit confused on this.  You no longer fix bugs in Python 2?

I've dug out the references that state that you are not allowed to do things the
way you are doing them.  This is from the published Unicode Standard version 6.0.0,
chapter 3, Conformance.  It is a very important chapter.


Python is in violation of that published Standard by interpreting noncharacter code
points as abstract characters and tolerating them in character encoding forms like
UTF-8 or UTF-16.  This explains that conformant processes are forbidden from doing this.

    Code Points Unassigned to Abstract Characters

     C1 A process shall not interpret a high-surrogate code point or a low-surrogate code point
         as an abstract character.
       · The high-surrogate and low-surrogate code points are designated for surrogate
         code units in the UTF-16 character encoding form. They are unassigned to any
         abstract character.

==>  C2 A process shall not interpret a noncharacter code point as an abstract character.
       · The noncharacter code points may be used internally, such as for sentinel val-
         ues or delimiters, but should not be exchanged publicly.

     C3 A process shall not interpret an unassigned code point as an abstract character.
       · This clause does not preclude the assignment of certain generic semantics to
         unassigned code points (for example, rendering with a glyph to indicate the
         position within a character block) that allow for graceful behavior in the pres-
         ence of code points that are outside a supported subset.
       · Unassigned code points may have default property values. (See D26.)
       · Code points whose use has not yet been designated may be assigned to abstract
         characters in future versions of the standard. Because of this fact, due care in
         the handling of generic semantics for such code points is likely to provide better
         robustness for implementations that may encounter data based on future ver-
         sions of the standard.

Next we have exactly how something you call UTF-{8,16-32} must be formed.
*This* is the Standard against which these things are measured; it is not the RFC.

You are of course perfectly free to say you conform to this and that RFC, but you
must not say you conform to the Unicode Standard when you don't.  These are different
things.  I feel it does users a grave disservice to ignore the Unicode Standard in
this, and sheer casuistry to rely on an RFC definition while ignoring the Unicode
Standard whence it originated, because this borders on being intentionally misleading.

    Character Encoding Forms

     C8 When a process interprets a code unit sequence which purports to be in a Unicode char-
         acter encoding form, it shall interpret that code unit sequence according to the corre-
         sponding code point sequence.
==>    · The specification of the code unit sequences for UTF-8 is given in D92.
       · The specification of the code unit sequences for UTF-16 is given in D91.
       · The specification of the code unit sequences for UTF-32 is given in D90.

     C9 When a process generates a code unit sequence which purports to be in a Unicode char-
         acter encoding form, it shall not emit ill-formed code unit sequences.
       · The definition of each Unicode character encoding form specifies the ill-
         formed code unit sequences in the character encoding form. For example, the
         definition of UTF-8 (D92) specifies that code unit sequences such as <C0 AF>
         are ill-formed.

==> C10 When a process interprets a code unit sequence which purports to be in a Unicode char-
         acter encoding form, it shall treat ill-formed code unit sequences as an error condition
         and shall not interpret such sequences as characters.
       · For example, in UTF-8 every code unit of the form 110xxxx2 must be followed
         by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxxx2
         is ill-formed and must never be generated. When faced with this ill-formed
         code unit sequence while transforming or interpreting text, a conformant pro-
         cess must treat the first code unit 110xxxxx2 as an illegally terminated code unit
         sequence--for example, by signaling an error, filtering the code unit out, or
         representing the code unit with a marker such as U+FFFD replacement
       · Conformant processes cannot interpret ill-formed code unit sequences. How-
         ever, the conformance clauses do not prevent processes from operating on code
         unit sequences that do not purport to be in a Unicode character encoding form.
         For example, for performance reasons a low-level string operation may simply
         operate directly on code units, without interpreting them as characters. See,
         especially, the discussion under D89.
       · Utility programs are not prevented from operating on "mangled" text. For
         example, a UTF-8 file could have had CRLF sequences introduced at every 80
         bytes by a bad mailer program. This could result in some UTF-8 byte sequences
         being interrupted by CRLFs, producing illegal byte sequences. This mangled
         text is no longer UTF-8. It is permissible for a conformant program to repair
         such text, recognizing that the mangled text was originally well-formed UTF-8
         byte sequences. However, such repair of mangled data is a special case, and it
         must not be used in circumstances where it would cause security problems.
         There are important security issues associated with encoding conversion, espe-
         cially with the conversion of malformed text. For more information, see Uni-
         code Technical Report #36, "Unicode Security Considerations."

Here is the part that explains why Python narrow builds are actually UTF-16 not UCS-2,
and why its documentation needs to be updated:

    D89 In a Unicode encoding form: A Unicode string is said to be in a particular Unicode
           encoding form if and only if it consists of a well-formed Unicode code unit sequence
           of that Unicode encoding form.
        · A Unicode string consisting of a well-formed UTF-8 code unit sequence is said
           to be in UTF-8. Such a Unicode string is referred to as a valid UTF-8 string, or a
           UTF-8 string for short.
        · A Unicode string consisting of a well-formed UTF-16 code unit sequence is said
           to be in UTF-16. Such a Unicode string is referred to as a valid UTF-16 string,
           or a UTF-16 string for short.
        · A Unicode string consisting of a well-formed UTF-32 code unit sequence is said
           to be in UTF-32. Such a Unicode string is referred to as a valid UTF-32 string,
           or a UTF-32 string for short.

==> Unicode strings need not contain well-formed code unit sequences under all conditions.
    This is equivalent to saying that a particular Unicode string need not be in a Unicode
    encoding form.

        · For example, it is perfectly reasonable to talk about an operation that takes the
           two Unicode 16-bit strings, <004D D800> and <DF02 004D>, each of which
           contains an ill-formed UTF-16 code unit sequence, and concatenates them to
           form another Unicode string <004D D800 DF02 004D>, which contains a well-
           formed UTF-16 code unit sequence. The first two Unicode strings are not in
           UTF-16, but the resultant Unicode string is.


     D14 Noncharacter: A code point that is permanently reserved for internal use and that
           should never be interchanged. Noncharacters consist of the values U+nFFFE and
           U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF.
         · For more information, see Section 16.7, Noncharacters.
         · These code points are permanently reserved as noncharacters.

     D15 Reserved code point: Any code point of the Unicode Standard that is reserved for
           future assignment. Also known as an unassigned code point.
         · Surrogate code points and noncharacters are considered assigned code points,
           but not assigned characters.
         · For a summary classification of reserved and other types of code points, see
           Table 2-3.

    In general, a conforming process may indicate the presence of a code point whose use has
    not been designated (for example, by showing a missing glyph in rendering or by signaling
    an appropriate error in a streaming protocol), even though it is forbidden by the standard
    from interpreting that code point as an abstract character.

Here's how I read all that.

The noncharacters and the unpaired surrogates are illegal for interchange, and their
presence in a UTF means that that UTF is not conformant to the requirements for what
a UTF shall contain.  Nonetheless, internally it is necessary that all code points,
even noncharacter code points and surrogates, be representable, and doing so does not
mean that you are no longer are in that encoding form.  However, you must not allow
such things into a UTF stream, because doing so means that that stream is no longer
a UTF stream.

That's why I say that you are of conformance by having encoders and decoders of UTF
streams tolerate noncharacters.  You are not allowed to call something a UTF and do
non-UTF things with it, because this in violation of conformance requirement C2.
Therefore you must either (1) change what you are calling the thing you doing the
nonconforming thing to, or you must (2) change it to no longer do the nonconforming
thing.  If you do neither, then Python no longer conforms to the formal requirements
for handling such things as these are defined by the Unicode Standard, and therefore
that version of Python is no longer conformant to the version of the Unicode Standard
that it purports conformance to.  And yes, that's a long way of saying it's lying.

It's also why having noncharacters including surrogates in memory does *not* suddenly
mean that there are not stored in a UTF, because you have to be able to do that to
build up buffers per the concatenation example in conformance requirement D89.
Therefore, Python uses UTF-16 internally and should not say it uses UCS-2, because
that is inaccurate and incorrect; in short, it's wrong.  That doesn't help anybody.

At least, that's how I read the Unicode Standard.  Perhaps a more careful reading
than mine would admit alternate interpretations.  If you have not reread its Chapter
3 of late in its entirety, you probably want to do so.  There is quite a bit of
material there that is fundamental to any process that claims to be conformant with
the Unicode Standard.

I hope that makes sense.  These things can be extremely difficult to read, for they
are subtle and quick to anger. :)



Python tracker <report at bugs.python.org>

More information about the Python-bugs-list mailing list