[issue12736] Request for python casemapping functions to use full not simple casemaps per Unicode's recommendation

Tom Christiansen report at bugs.python.org
Sat Aug 27 16:48:39 CEST 2011


Tom Christiansen <tchrist at perl.com> added the comment:

Guido van Rossum <report at bugs.python.org> wrote
   on Fri, 26 Aug 2011 21:11:24 -0000: 

> Would this also affect .islower() and friends?

SHORT VERSION:  (7 lines)

    I don't believe so, but the relationship between lower() and islower()
    is not as clear to me as I would have thought, and more importantly,
    the code and the documentation for Python's islower() etc currently seem
    to disagree.  For future releases, I recommend fixing the code, but if
    compatibility is an issue, then perhaps for previous releases still in
    maintenance mode fixing only the documentation would possibly be good
    enough--your call.

=======================================================================

MEDIUM VERSION: (87 lines)

I was initially confused with Python's islower() family because of the way
they are defined to operate on full strings.  They don't check that
everything is lowercase even though they say they do.

 <  http://docs.python.org/py3k/library/stdtypes.html#sequence-types-str-bytes-bytearray-list-tuple-range

    str.lower()

        Return a copy of the string with all the cased characters [4]
        converted to lowercase.

    str.islower()

        Return true if all cased characters [4] in the string are lowercase 
        and there is at least one cased character, false otherwise.

    [4] (1, 2, 3, 4) Cased characters are those with general category
        property being one of “Lu” (Letter, uppercase), “Ll” (Letter,
        lowercase), or “Lt” (Letter, titlecase).

This is strange in several ways.  Of lesser importance is that
strings can be considered lowercase even if they don't match

    ^\p{lowercase}+$

Another is that the result of calling str.lower() may not be .islower().
I'm not sure what these are particularly for, since I myself would just use
a regex to get finer-grained control.  (I suppose that's because re doesn't
give access to the Unicode properties needed that this approach never
gained any traction in the Python community.)

However, the worst of this is that the documentation defines both cased
characters and lowercase characters *differently* from how Unicode does
defines those very same terms.  This was quite confusing.

Unicode distinguishes Cased code points from Cased_*Letter* code points.
Python is using the Cased_Letter property but calling it Cased.  Cased in 
a proper superset of Cased_Letter.  From the DerivedCoreProperties file in
the Unicode Character Database:

    # Derived Property:   Cased (Cased)
    #  As defined by Unicode Standard Definition D120
    #  C has the Lowercase or Uppercase property or has a General_Category value of Titlecase_Letter.

In the same way, the Lowercase and Uppercase properties are not the same as
the Lowercase_*Letter* and Uppercase_*Letter* properties.  Rather, the former
are respectively proper supersets of the latter.  

    # Derived Property: Lowercase
    #  Generated from: Ll + Other_Lowercase

    [...]

    # Derived Property: Uppercase
    #  Generated from: Lu + Other_Uppercase

In all these, you almost always want the superset versions not the
restricted subset versions you are using.  If it were in the regex engine,
the user could select either.

Java used to miss all these, too.  But in 1.7, they updated their character
methods to use the properties that they'd all along said they were using:

  < http://download.oracle.com/javase/7/docs/api/java/lang/Character.html#isLowerCase(char)

    public static boolean isLowerCase(char ch)
    Determines if the specified character is a lowercase character. 

     A character is lowercase if its general category type, provided by
     Character.getType(ch), is LOWERCASE_LETTER, or it has contributory
->   property Other_Lowercase as defined by the Unicode Standard.

    Note: This method cannot handle supplementary characters.  To
          support all Unicode characters, including supplementary
          characters, use the isLowerCase(int) method.

(And yes, that's where Java uses "character" to mean "code unit" 
 not "code point", alas.  No wonder people get confused)

I'm pretty sure that Python needs to either update its documentation to
match its code, update its code to match its documentation, or both.  Java
chose to update the code to match the documentation, and this is the course
I would recommend if at all possible.  If you say you are checking for
cased code points, then you should use the Unicode definition of cased code
points not your own, and if you say you are checking for lowercase code
points, then you should use the Unicode definition not your own.  Both of
these require access to contributory properties from the UCD and not 
just general categories alone.

--tom

=======================================================================

LONG VERSION: (222 lines)

Essential tools I use for inspecting Unicode code points and their 
properties include

    http://training.perl.com/scripts/unichars
    http://training.perl.com/scripts/uniprops
    http://training.perl.com/scripts/uninames

And over the course of the day, these get used a fair bit, too:

    http://training.perl.com/scripts/uniquote
    http://training.perl.com/scripts/ucsort
    http://training.perl.com/scripts/unifmt

Here for example are (some of) the *non*-Letter code point that
are nonetheless considered lowercase or uppercase because
they have the Other_{Lower,Upper}case properties:

    % unichars -gs '\PL' '[\p{upper}\p{lower}]'
     ○ͅ  U+00345 GC=Mn SC=Inherited    COMBINING GREEK YPOGEGRAMMENI
     Ⅰ  U+02160 GC=Nl SC=Latin        ROMAN NUMERAL ONE
     Ⅱ  U+02161 GC=Nl SC=Latin        ROMAN NUMERAL TWO
     Ⅲ  U+02162 GC=Nl SC=Latin        ROMAN NUMERAL THREE
     [...]
     ⅰ  U+02170 GC=Nl SC=Latin        SMALL ROMAN NUMERAL ONE
     ⅱ  U+02171 GC=Nl SC=Latin        SMALL ROMAN NUMERAL TWO
     ⅲ  U+02172 GC=Nl SC=Latin        SMALL ROMAN NUMERAL THREE
     [...]
     Ⓐ  U+024B6 GC=So SC=Common       CIRCLED LATIN CAPITAL LETTER A
     Ⓑ  U+024B7 GC=So SC=Common       CIRCLED LATIN CAPITAL LETTER B
     Ⓒ  U+024B8 GC=So SC=Common       CIRCLED LATIN CAPITAL LETTER C
     [...]
     ⓐ  U+024D0 GC=So SC=Common       CIRCLED LATIN SMALL LETTER A
     ⓑ  U+024D1 GC=So SC=Common       CIRCLED LATIN SMALL LETTER B
     ⓒ  U+024D2 GC=So SC=Common       CIRCLED LATIN SMALL LETTER C
     [...]

And here are (some of) the letters that are cased but which are
not Lu, Lt, or Ll (they're all Lm, in fact):

    % unichars -gs '\p{Lm}' '\p{cased}'  | ucsort
     ᴭ  U+1D2D GC=Lm SC=Latin        MODIFIER LETTER CAPITAL AE
     ᴬ  U+1D2C GC=Lm SC=Latin        MODIFIER LETTER CAPITAL A
     ᵃ  U+1D43 GC=Lm SC=Latin        MODIFIER LETTER SMALL A
     ₐ  U+2090 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER A
     ᵅ  U+1D45 GC=Lm SC=Latin        MODIFIER LETTER SMALL ALPHA
     ᴮ  U+1D2E GC=Lm SC=Latin        MODIFIER LETTER CAPITAL B
     ᵇ  U+1D47 GC=Lm SC=Latin        MODIFIER LETTER SMALL B
     [...]
     ʷ  U+02B7 GC=Lm SC=Latin        MODIFIER LETTER SMALL W
     ᵂ  U+1D42 GC=Lm SC=Latin        MODIFIER LETTER CAPITAL W
     ˣ  U+02E3 GC=Lm SC=Latin        MODIFIER LETTER SMALL X
     ₓ  U+2093 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER X
     ʸ  U+02B8 GC=Lm SC=Latin        MODIFIER LETTER SMALL Y
     ᶻ  U+1DBB GC=Lm SC=Latin        MODIFIER LETTER SMALL Z
     ᵝ  U+1D5D GC=Lm SC=Greek        MODIFIER LETTER SMALL BETA
     ᵞ  U+1D5E GC=Lm SC=Greek        MODIFIER LETTER SMALL GREEK GAMMA
     ᵟ  U+1D5F GC=Lm SC=Greek        MODIFIER LETTER SMALL DELTA
     ᶿ  U+1DBF GC=Lm SC=Greek        MODIFIER LETTER SMALL THETA
     ͺ  U+037A GC=Lm SC=Greek        GREEK YPOGEGRAMMENI
     ᵠ  U+1D60 GC=Lm SC=Greek        MODIFIER LETTER SMALL GREEK PHI
     ᵡ  U+1D61 GC=Lm SC=Greek        MODIFIER LETTER SMALL CHI
     ᵸ  U+1D78 GC=Lm SC=Cyrillic     MODIFIER LETTER CYRILLIC EN

Perversely, here are some of the modifier letters which are *not* cased:

    % unichars -gs '\p{Lm}' '\P{CASED}'  | ucsort
     ₕ  U+2095 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER H
     ʻ  U+02BB GC=Lm SC=Common       MODIFIER LETTER TURNED COMMA
     ʽ  U+02BD GC=Lm SC=Common       MODIFIER LETTER REVERSED COMMA
     ⁱ  U+2071 GC=Lm SC=Latin        SUPERSCRIPT LATIN SMALL LETTER I
     ₖ  U+2096 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER K
     ₗ  U+2097 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER L
     ₘ  U+2098 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER M
     ⁿ  U+207F GC=Lm SC=Latin        SUPERSCRIPT LATIN SMALL LETTER N
     ₙ  U+2099 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER N
     ₚ  U+209A GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER P
     ₛ  U+209B GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER S
     ₜ  U+209C GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER T
     ʹ  U+02B9 GC=Lm SC=Common       MODIFIER LETTER PRIME
     ʺ  U+02BA GC=Lm SC=Common       MODIFIER LETTER DOUBLE PRIME
     ˆ  U+02C6 GC=Lm SC=Common       MODIFIER LETTER CIRCUMFLEX ACCENT
     ˇ  U+02C7 GC=Lm SC=Common       CARON
     ˈ  U+02C8 GC=Lm SC=Common       MODIFIER LETTER VERTICAL LINE
     ˉ  U+02C9 GC=Lm SC=Common       MODIFIER LETTER MACRON
     ˊ  U+02CA GC=Lm SC=Common       MODIFIER LETTER ACUTE ACCENT
     ˋ  U+02CB GC=Lm SC=Common       MODIFIER LETTER GRAVE ACCENT
     ˌ  U+02CC GC=Lm SC=Common       MODIFIER LETTER LOW VERTICAL LINE

(Interesting how the commas sort as breath marks next to H.)

I cannot for the life of me figure out why Unicode deems these lowercase:

     ᵃ  U+1D43 GC=Lm SC=Latin        MODIFIER LETTER SMALL A
     ₐ  U+2090 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER A
     ᵅ  U+1D45 GC=Lm SC=Latin        MODIFIER LETTER SMALL ALPHA

yet these *not* to be cased:

     ⁱ  U+2071 GC=Lm SC=Latin        SUPERSCRIPT LATIN SMALL LETTER I
     ₘ  U+2098 GC=Lm SC=Latin        LATIN SUBSCRIPT SMALL LETTER M
     ⁿ  U+207F GC=Lm SC=Latin        SUPERSCRIPT LATIN SMALL LETTER N

All I know is that the tables tell me.

Here's a fair assortment of cased and noncased, case-changing and
non-casing code points.  The variation in binary properties is pretty wide.

    $ uniprops x 00aa 1d4e 2071 2172 df 262 1d401 1d42d 2117 24c5

    U+0078 ‹x› \N{LATIN SMALL LETTER X}
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
        All Any Alnum Alpha Alphabetic ASCII Assigned Basic_Latin Cased Cased_Letter LC Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Lowercase_Letter Lower Lowercase PerlWord POSIX_Alnum POSIX_Alpha POSIX_Graph POSIX_Lower POSIX_Print POSIX_Word Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word

    U+00AA ‹ª› \N{FEMININE ORDINAL INDICATOR}
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
        All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased Cased_Letter LC Changes_When_NFKC_Casefolded CWKCF Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Latin_1 Latin_1_Supplement Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word

    U+1D4E <ᵎ> \N{MODIFIER LETTER SMALL TURNED I}
        \w \pL \p{L_} \p{Lm}
        All Any Alnum Alpha Alphabetic Assigned InPhoneticExtensions Case_Ignorable CI Cased Dia Diacritic L Lm Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Modifier_Letter Lower Lowercase Phonetic_Extensions Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word

    U+2071 <ⁱ> \N{SUPERSCRIPT LATIN SMALL LETTER I}
        \w \pL \p{L_} \p{Lm}
        All Any Alnum Alpha Alphabetic Assigned InSuperscriptsAndSubscripts Case_Ignorable CI Changes_When_NFKC_Casefolded CWKCF L Lm Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Modifier_Letter Print SD Soft_Dotted Superscripts_And_Subscripts Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word

    U+2172 <ⅲ> \N{SMALL ROMAN NUMERAL THREE}
        \w \pN \p{Nl}
        All Any Alnum Alpha Alphabetic Assigned InNumberForms Cased Changes_When_Casemapped CWCM Changes_When_NFKC_Casefolded CWKCF Changes_When_Titlecased CWT Changes_When_Uppercased CWU Nl N Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Latin Latn Letter_Number Lower Lowercase Number Number_Forms Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word

    U+00DF <ß> \N{LATIN SMALL LETTER SHARP S}
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
        All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_NFKC_Casefolded CWKCF Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Latin_1 Latin_1_Supplement Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word

    U+0262 <ɢ> \N{LATIN LETTER SMALL CAPITAL G}
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
        All Any Alnum Alpha Alphabetic Assigned InIPA_Extensions Cased Cased_Letter LC Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS IPA_Extensions Letter L_ Latin Latn Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word

    U+1D401 <𝐁> \N{MATHEMATICAL BOLD CAPITAL B}
        \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
        All Any Alnum Alpha Alphabetic Assigned InMathematicalAlphanumericSymbols Cased Cased_Letter LC Changes_When_NFKC_Casefolded CWKCF Common Zyyy Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Math Mathematical_Alphanumeric_Symbols Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word

    U+1D42D <𝐭> \N{MATHEMATICAL BOLD SMALL T}
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
        All Any Alnum Alpha Alphabetic Assigned InMathematicalAlphanumericSymbols Cased Cased_Letter LC Changes_When_NFKC_Casefolded CWKCF Common Zyyy Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Lowercase_Letter Lower Lowercase Math Mathematical_Alphanumeric_Symbols Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word

    U+2117 ‹℗› \N{SOUND RECORDING COPYRIGHT}
        \pS \p{So}
        All Any Assigned InLetterlikeSymbols Common Zyyy So S Gr_Base Grapheme_Base Graph GrBase Letterlike_Symbols Other_Symbol Print Symbol X_POSIX_Graph X_POSIX_Print

    U+24C5 ‹Ⓟ› \N{CIRCLED LATIN CAPITAL LETTER P}
        \w \pS \p{So}
        All Any Alnum Alpha Alphabetic Assigned InEnclosedAlphanumerics Cased Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Common Zyyy Enclosed_Alphanumerics So S Gr_Base Grapheme_Base Graph GrBase Other_Symbol Print Symbol Upper Uppercase Word X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word

Unicode also has a Case_Ignorable (CI) character property, which I haven't 
thought much about but which might be useful.  

    http://www.unicode.org/reports/tr44/#Case_Ignorable

        Characters which are ignored for casing purposes. For more information,
        see D121 in Section 3.13, Default Case Algorithms in [Unicode].

        Generated from: Mn + Me + Cf + Lm + Sk + Word_Break=MidLetter + Word_Break=MidNumLet

I'm not sure if you should think about these when doing your isupper()
test; maybe you should.  That way you wouldn't fail just because you had
a code point that was technically lowercase, like if someone used
"LEONARD MᶜCOY".  That funny ᶜ wouldn't count as a spoiler then, so that
"Leonard MᶜCoy".upper().isupper() could be true, as the ᶜ wouldn't
change but wouldn't count, either.  I haven't thought about this enough
though.  I'm not used to full string-based isupper() functions, so my
instincts may be wrong here.

The only code point that is both CWCM and also CI is the notorious

     ○ͅ  U+00345 GC=Mn SC=Inherited    COMBINING GREEK YPOGEGRAMMENI

Subscripts, superscripts, modifier letters, small capitals, and mathematical
letters *tend* to be cased code points that do not change when casemapped
or casefolded, although there are exceptions.

    % uninames small capital '\b\R\b'
     ʀ  0280        LATIN LETTER SMALL CAPITAL R
            * voiced uvular trill
            * Germanic, Old Norse
            * uppercase is 01A6
     ʁ  0281        LATIN LETTER SMALL CAPITAL INVERTED R
            * voiced uvular fricative or approximant
            x (modifier letter small capital inverted r - 02B6)
     ʶ  02B6        MODIFIER LETTER SMALL CAPITAL INVERTED R
            * preceding four used for r-coloring or r-offglides
            x (latin letter small capital inverted r - 0281)
            # <super> 0281
     ᴙ  1D19        LATIN LETTER SMALL CAPITAL REVERSED R
     ᴚ  1D1A        LATIN LETTER SMALL CAPITAL TURNED R
      ᷢ  1DE2       COMBINING LATIN LETTER SMALL CAPITAL R

   % uniprops 280 1a6
    U+0280 <ʀ> \N{LATIN LETTER SMALL CAPITAL R}
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
        All Any Alnum Alpha Alphabetic Assigned InIPA_Extensions Cased Cased_Letter LC Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS IPA_Extensions Letter L_ Latin Latn Lowercase_Letter Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower
           X_POSIX_Print X_POSIX_Word
    U+01A6 <Ʀ> \N{LATIN LETTER YR}
        \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
        All Any Alnum Alpha Alphabetic Assigned InLatinExtendedB Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Latin_Extended_B Uppercase_Letter Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word

That's right: the uppercase of LATIN LETTER SMALL CAPITAL R is LATIN LETTER
YR, and I don't know why.  No other small capital -- which are all considered
lowercase -- changes when casemapped.  Only this one alone.

Note that things like code points like U+00DF LATIN SMALL LETTER SHARP S
have these binary properties true because the normal/default sense of these
terms in Unicode is the full/string sense not the simple/character sense:

        Changes_When_Casefolded (CWCF) 
        Changes_When_Casemapped (CWCM)
        Changes_When_Titlecased (CWT) 
        Changes_When_Uppercased (CWU)

Those are true because the full uppercase map of "ß" is "SS" 
and the full casefold of "ß"  is "ss".

--tom

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue12736>
_______________________________________


More information about the Python-bugs-list mailing list