[docs] [issue12731] python lib re uses obsolete sense of \w in full violation of UTS#18 RL1.2a

Tom Christiansen report at bugs.python.org
Sat Aug 13 02:18:24 CEST 2011


Tom Christiansen <tchrist at perl.com> added the comment:

> Terry J. Reedy <tjreedy at udel.edu> added the comment:

> However desireable it would be, I do not believe there is any claim in the =
> manual that the re module follows the evolving Unicode consortium r.e. stan=

My from the hip thought is that if re cannot be fixed to follow
the Unicode Standard, it should be deprecated in favor of code
that can if such is available, because you cannot process Unicode
text with regular expressions otherwise.

> dard. If I understand, you are saying that this statement in the doc, "Matc=
> hes Unicode word characters;" is not now correct and should be revised. Was=
>  it once correct? Could we add "by an older definition of 'word' character"=
> ?

Yes, your hunch is exactly correct.  They once had a lesser definition that
they have now.  It is very very old.  I had to track this down for Java
once.  There is some discussion of a "word_character class" at least 
as far back as tr18v3 from back in 1998.

    http://www.unicode.org/reports/tr18/tr18-3.html

By the time tr18v5 rolled around just a year later in 1999, the overall
document has changed substantially, and you can clearly see its current
shape there.  Word characters are supposed to include all code points with
the Alphabetic property, for example.  

    http://www.unicode.org/reports/tr18/tr18-5.html

However, the word "alphabetic" has *never* been synonymous in 
Unicode with 

    \p{gc=Lu}
    \p{gc=Ll}
    \p{gc=Lt}
    \p{gc=Lm}
    \p{gc=Lo}

as many people incorrectly assume, nor certainly to 

    \p{gc=Lu}
    \p{gc=Ll}
    \p{gc=Lt}

let alone to 

    \p{gc=Lu}
    \p{gc=Ll}

Rather, it has since its creation included code points that are not
letters, such as all GC=Nl and also certain GC=So code points.  And,
notoriously, U+0345. Indeed it is here I first noticed that that Python had
already broken with the Standard, because U+0345 COMBINING GREEK
YPOGEGRAMMENI is GC=Mn, but Alphabetic=True, yet I have shown that 
Python's title method is messing up there.  

I wouldn't spend too much in archaeological digs, though, because lots of
stuff has changed since the less millennium.  It was in tr18v7 from 2003-05
that we hit paydirt, because this is when the famous Annex C of RL1.2a 
fame first appeared:

    http://www.unicode.org/reports/tr18/tr18-7.html#Compatibility_Properties

Notice how it defines \w to be nothing more than \p{alpha}, \p{digit}, and
\p{gc=Pc}.  It does not yet contain the requirement that all Marks be
counted as part of the word, just the few that are alphas -- which the
U+0345 counts for, since it has an uppercase map of a capital iota!

That particular change did not occur until tr18v8 in 2003-08, barely
a scant three months later.

    http://www.unicode.org/reports/tr18/tr18-8.html#Compatibility_Properties

Now at last we see word characters defined in the modern way that we 
have become used to.  They must match any of:

    \p{alpha}
    \p{gc=Mark}
    \p{digit}
    \p{gc=Connector_Punctuation}

BTW, Python is matching  all of 

    \p{GC=N}

meaning

    \p{GC=Nd}
    \p{GC=Nl}
    \p{GC=No}

instead of the required 

    \p{GC=Nd}

which is a synonym for \p{digit}.

I don't know had that happened, because \w has never included
all number code points in Unicode, only the decimal number ones.

That all goes to show why, when citing conformance to some aspect of 
The Unicode Standard, one must be exceedingly careful just how one 
does so!
The Unicode Consortium recognizes this is an issue, and I am pretty
sure I can hear it in your own subtext as well.  

Kindly bear with and forgive me for momentarily sounding like a standard
lawyer.  I do this because to show not just why it is important to get
references to the Unicode Standard correct, but indeed, how to do so.

After I have given the formal requirements, I will then produce
illustrations of various purported claims, some of which meet the
citation requirements, and others which do not.

=======================================================================

To begin with, there is an entire technical report on conformance.
It includes:

    http://unicode.org/reports/tr33/

    The Unicode Standard [Unicode] is a very large and complex standard.
    Because of this complexity, and because of the nature and role of the
    standard, it is often rather difficult to determine, in any particular
    case, just exactly what conformance to the Unicode Standard means.

...

    Conformance claims must be specific to versions of the Unicode
    Standard, but the level of specificity needed for a claim may vary
    according to the nature of the particular conformance claim. Some
    standards developed by the Unicode Consortium require separate
    conformance to a specific version (or later), of the Unicode Standard.
    This version is sometimes called the  base version. In such cases, the
    version of the standard and the version of the Unicode Standard to
    which the conformance claim refers must be compatible.

However, you don't need to read tr33, really, because *the* most important
thing bits about conformance are to be found on pp. 57-58 of Chapter 3 of
the published Unicode Standard:

    http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

    References to the Unicode Standard

    The documents associated with the major, minor, and update versions are called the major
    reference, minor reference, and update reference, respectively. For example, consider Uni-
    code Version 3.1.1. The major reference for that version is The Unicode Standard, Version
    3.0 (ISBN 0-201-61633-5). The minor reference is Unicode Standard Annex #27, "The Uni-
    code Standard, Version 3.1." The update reference is Unicode Version 3.1.1. The exact list of
    contributory files, Unicode Standard Annexes, and Unicode Character Database files can
    be found at Enumerated Version 3.1.1.

    The reference for this version, Version 6.0.0, of the Unicode Standard, is

         The Unicode Consortium. The Unicode Standard, Version 6.0.0, defined
         by: The Unicode Standard, Version 6.0 (Mountain View, CA: The Uni-
         code Consortium, 2011. ISBN 978-1-936213-01-6)

    References to an update (or minor version prior to Version 5.2.0) include a reference to
    both the major version and the documents modifying it. For the standard citation format
    for other versions of the Unicode Standard, see "Versions" in Section B.6, Other Unicode
    Online Resources.

    Precision in Version Citation

    Because Unicode has an open repertoire with relatively frequent updates, it is important
    not to over-specify the version number. Wherever the precise behavior of all Unicode char-
    acters needs to be cited, the full three-field version number should be used, as in the first
    example below. However, trailing zeros are often omitted, as in the second example. In such
    a case, writing 3.1 is in all respects equivalent to writing 3.1.0.

       1. The Unicode Standard, Version 3.1.1
       2. The Unicode Standard, Version 3.1
       3. The Unicode Standard, Version 3.0 or later
       4. The Unicode Standard

    Where some basic level of content is all that is important, phrasing such as in the third
    example can be used. Where the important information is simply the overall architecture
    and semantics of the Unicode Standard, the version can be omitted entirely, as in example 4.

    References to Unicode Character Properties

    Properties and property values have defined names and abbreviations, such as

	  Property:           General_Category (gc)
	  Property Value: Uppercase_Letter (Lu)

    To reference a given property and property value, these aliases are used, as in this example:

	  The property value Uppercase_Letter from the General_Category prop-
	  erty, as specified in Version 6.0.0 of the Unicode Standard.

    Then cite that version of the standard, using the standard citation format that is provided
    for each version of the Unicode Standard.

    When referencing multi-word properties or property values, it is permissible to omit the
    underscores in these aliases or to replace them by spaces.

    When referencing a Unicode character property, it is customary to prepend the word "Uni-
    code" to the name of the property, unless it is clear from context that the Unicode Standard
    is the source of the specification.

    References to Unicode Algorithms

    A reference to a Unicode algorithm must specify the name of the algorithm or its abbrevia-
    tion, followed by the version of the Unicode Standard, as in this example:

      The Unicode Bidirectional Algorithm, as specified in Version
      6.0.0 of the Unicode Standard.

      See Unicode Standard Annex #9, "Unicode Bidirectional Algorithm,"
      (http://www.unicode.org/reports/tr9/tr9-23.html)

=======================================================================

Now for some concrete citation examples, both correct and dubious.

In the JDK7 documentation for on the Character class we find:

    Character information is based on the Unicode Standard, version 6.0.0.

That one is a perfectly good conformance citation, even if there seems 
a bit of wiggle in "is based on", but no matter.  It is short and does
everything it needs to.

However, in the JDK7 documentation for the Pattern class we
somewhat problematically find:

     Unicode support 

     This class is in conformance with Level 1 of Unicode Technical
     Standard #18: Unicode Regular Expression, plus RL2.1 Canonical
     Equivalents.

And similarly, in the JDK7 documentation for the Normalizer class we find:

    This class provides the method normalize which transforms Unicode
    text into an equivalent composed or decomposed form, allowing for
    easier sorting and searching of text. The normalize method supports
    the standard normalization forms described in  Unicode Standard
    Annex #15 — Unicode Normalization Forms.

The problem with those second two Java refs is that they to my reading
appear to be in technical violation, for they give neither a full
version number nor a date of publication.  

You *have* to give one or the other, or both.  

Java got themselves into a heap of trouble (so to speak) over
this once before because it turned out that the version of the
document they were actually in conformance with was quite
literally from the previous millennium!!

That's why you need to give versions and publication dates.

Here are some other citations.

First, from the perldelta manpage that the Perl 5.14 release ships with:

       Perl comes with the Unicode 6.0 data base updated with Corrigendum
       #8 <http://www.unicode.org/versions/corrigendum8.html>, with one
       exception noted below.  See <http://unicode.org/versions/Unicode6.0.0/> 
       for details on the new release.  Perl does not support any Unicode 
       provisional properties, including the new ones for this release.

That is quite complete, as it even includes the specific which
corrigenda we follow and explains the matter of properties.

Or this from the perlunicode manpage of that same release:

   Unicode Regular Expression Support Level
       The following list of Unicode supported features for
       regular expressions describes all features currently
       directly supported by core Perl.  The references to "Level
       N" and the section numbers refer to the Unicode Technical
       Standard #18, "Unicode Regular Expressions", version 13,
       from August 2008.

See all that?  Notice how it gives the name of the document, its revision
number, and its publication date.  You don't have to do all that for the
main Unicode release, but you really ought to when referring to individual
technical reports BECAUSE THESE GET UPDATED ASYNCRONOUSLY.

I would suggest you pick a version of tr18 that you conform to,
and state which of its requirements you do and do not meet.

However, I cannot find any version of tr18 that has existed during the
present millennium that Python comes even close to meeting more than one
or two requirements for.  Given that, it may be better to no longer make
any claims regarding Unicode at all.  That seems like back-peddaling to
me, not future-thinking.

Matthew's regex module, however, does *almost* everything right that re
does wrong.  It may be that as with Java's io vs nio classes (and now
heaven forbid nio2!), you actually can't fix the only module and must
create a wholly new namespace.  I cannot answer that.

For RL1.2 proper, the first properties requirement, Java was only missing a
few, so they went and added the missing properties.  I strongly urge you to
do so because you cannot handle Unicode without properties.  Rl1.2 requires
only 11 of them, so it isn't too hard.  Matthew supports many many more.

However, because the \w&c issues are bigger, Java addressed the tr18 RL1.2a
issues differently, this time by creating a new compilation flag called
UNICODE_CHARACTER_CLASSES (with corresponding embedded "(?U)" regex flag.)

Truth be told, even Perl has secret pattern compilation flags to govern
this sort of thing (ascii, locale, unicode), but we (well, I) hope you
never have to use or even notice them.  

That too might be a route forward for Python, although I am not quite sure
how much flexibility and control of your lexical scope you have.  However,
the "from __future_" imports suggest you may have enough to do something
slick so that only people who ask for it get it, and also importantly that
they get it all over the place so don't have to add an extra flag or u'...'
or whatever every single time.  

This isn't something I've looked much into, however.

Hope this clarifies things.

--tom

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue12731>
_______________________________________


More information about the docs mailing list