[issue12737] str.title() is overzealous by upcasing combining marks inappropriately
Tom Christiansen
report at bugs.python.org
Sat Oct 1 13:07:49 CEST 2011
Tom Christiansen <tchrist at perl.com> added the comment:
Martin v. Löwis <report at bugs.python.org> wrote
on Sat, 01 Oct 2011 10:59:48 -0000:
>> * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc.
> Where did you get that definition from? UTS#18 defines
> "<word_character>", which is Alphabetic + U+200C + U+200D
> (i.e. not including marks, but including those
>From UTS#18 RL1.2A in Annex C, where a \p{word} or \w character
is defined to be
\p{alpha}
\p{gc=Mark}
\p{digit}
\p{gc=Connector_Punctuation}
>> I think you are looking for here are Word characters without
>> Nd + Pc, so just Alphabetic + Mn+Mc+Me.
>>
>> Is that right?
>
> With your definition of "Word character" above, yes, that's right.
It's not mine. It's tr18's.
> Marks won't start a word, though.
That's the smarter boundary thing they talk about.
I'm not myself familiar with \pM
> As for terminology: I think the documentation should continue to
> speak about "words" and "letters", and then define what is meant
> in this context. It's not that the Unicode consortium invented
> the term "letter", so we should use it more liberally than just
> referring to the L* categories.
I really don't think it wise to have private definitions of these.
If Letter doesn't mean L?, things get too weird. That's why
there are separate definitions of alphabetic, word, etc.
--tom
----------
title: str.title() is overzealous by upcasing combining marks inappropriately -> str.title() is overzealous by upcasing combining marks inappropriately
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue12737>
_______________________________________
More information about the Python-bugs-list
mailing list