[New-bugs-announce] [issue12737] string.title() is overzealous by upcasing combining marks inappropriately
report at bugs.python.org
Fri Aug 12 00:37:34 CEST 2011
New submission from Tom Christiansen <tchrist at perl.com>:
Python's string.title() function claims it titlecases the first letter in each word and lowercases the rest. However, this is not true. It is not using either of the two word detection algorithms that Unicode provides. One allows you to use a legacy \w+, where \w means any Alphabetic, Mark, Decimal Number, or Connector Punctuation (see UTS#18 Annex C: Compatibility Properties), and the other uses the more sophisticated word-break provided by the Word_Break properties such as Word_Break=MidNumLet
Python is using neither of these, so gets the wrong answer.
titlecase of déme un café should be Déme Un Café not DéMe Un Café
titlecase of i̇stanbul should be İstanbul not İStanbul
titlecase of ᾲ στο διάολο should be Ὰͅ Στο Διάολο not ᾺΙ Στο ΔιάΟλο
Because those are in NFD form, you get different answers than if they are in NFC. That is not right. You should get the same answer. The bug is you aren't using the right definition for \w, and so get screwed up. This is likely related to issue 12731.
In the enclosed tester file, which fails 4 out of its 6 tests, there is also a bug shown with this failed result:
titlecase of 𐐼𐐯𐑅𐐨𐑉𐐯𐐻 should be 𐐔𐐯𐑅𐐨𐑉𐐯𐐻 not 𐐼𐐯𐑅𐐨𐑉𐐯𐐻
That one is related to issue 12730.
See the attached tester, which was run under Python 3.2. As far as I can tell, these bugs exist in all python versions.
title: string.title() is overzealous by upcasing combining marks inappropriately
versions: Python 3.2
Added file: http://bugs.python.org/file22884/titletest.python
Python tracker <report at bugs.python.org>
More information about the New-bugs-announce