[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

Sun Oct 2 20:41:12 CEST 2011

Tom Christiansen <tchrist at perl.com> added the comment:

Ezio Melotti <report at bugs.python.org> wrote
   on Sun, 02 Oct 2011 06:46:26 -0000: 

> Actually Python doesn't seem to support \N{LINE FEED (LF)}, most likely bec=
> ause that's a Unicode 1 name, and nowadays these codepoints are simply mark=
> ed as '<control>'.

Yes, but there are a lot of them, 65 of them in fact.  I do not care to 
see people being forced to use literal control characters or inscrutable
magic numbers.  It really bothers me that you have all these defined code 
points with properties and all that have no name.   People do use these.
Some of them a lot.  I don't mind \n and such -- and in fact, prefer them 
even -- but I feel I should not have scratch my head over character \033, \0177,
and brethren.  The C0 and C1 standards are not just inventions, so we use 
them.  Far better than one should write \N{ESCAPE} for \033 or \N{DELETE} 
for \0177, don't you think?  

>> If so, then I don't understand that.  Nobody in their right=20
>> mind prefers "\N{LINE FEED (LF)}" over "\N{LINE FEED}" -- do they?

> They probably don't, but they just write \n anyway.  I don't think we need =
> to support any of these aliases, especially if they are not defined in the =
> Unicode standard.

If you look at Names.txt, there are significant "aliases" there for 
the C0/C1 stuff.  My bottom line is that I don't like to be forced
to use magic numbers.  I prefer to name my abstactions.  It is more
readable and more maintainble that way.   

There are still "holes" of course.  Code point 128 has no name even in C1.
But something is better than nothing.  Plus at least in Perl we *can* give
things names if we want, per the APPLE LOGO example for U+F8FF.  So nothing
needs to remain nameless.  Why, you can even name your Kanji if you want, 
using whatever Romanization you prefer.  I think the private-use case
example is really motivating, but I have no idea how to do this for Python
because there is no lexical scope.  I suppose you could attach it to the
module, but that still doesn't really work because of how things get evaluated.
With a Perl compile-time use, we can change the compiler's ideas about
things, like adding function prototypes and even extending the base types:

    % perl -Mbigrat -le 'print 1/2 + 2/3 * 4/5'
    31/30

    % perl -Mbignum -le 'print 21->is_odd'
    1
    % perl -Mbignum -le 'print 18->is_odd'
    0

    % perl -Mbignum -le 'print substr(2**5000, -3)'
    376
    % perl -Mbignum -le 'print substr(2**5000-1, -3)'
    375

    % perl -Mbignum -le 'print length(2**5000)'
    1506
    % perl -Mbignum -le 'print length(10**5000)'
    5001

    % perl -Mbignum -le 'print ref 10**5000'
    Math::BigInt
    % perl -Mbigrat -le 'print ref 1/3'
    Math::BigRat

I recognize that redefining what sort of object the compiler treats some 
of its constants as is never going to happen in Python, but we actually
did manage that with charnames without having to subclass our strings:
the hook for \N{...} doesn't require object games like the ones above.

But it still has to happen at compile time, of course, so I don't know
what you could do in Python.  Is there any way to change how the compiler
behaves even vaguely along these lines?

The run-time looks of Python's unicodedata.lookup (like Perl's
charnames::viacode) and unicodedata.name (like Perl's charnames::viacode
on the ord) could be managed with a hook, but the compile-time lookups
of \N{...} I don't see any way around.  But I don't know anything about
Python's internals, so don't even know what is or is not possible.

I do note that if you could extend \N{...} the way we do with charname
aliases for private-use characters, the user could load something that 
did the C0 and C1 control if they wanted to.  I just don't know how to 
do that early enough that the Python compiler would see it.  Your import
happens at run-time or at compile-time?  This would be some sort of
compile-time binding of constants.

d=20
>> Python doesn't require it. :)/2

> I actually find those *less* readable.  If there's something fancy in the r=
> egex, a comment *before* it is welcomed, but having to read a regex divided=
> on several lines and remove meaningless whitespace and redundant comments =
> just makes the parsing more difficult for me.

Really?  White space makes things harder to read?  I thought Pythonistas
believed the opposite of that.  Whitespace is very useful for cognitive
chunking: you see how things logically group together.

Inomorewantaregexwithoutwhitespacethananyothercodeortext. :)

I do grant you that chatty comments may be a separate matter.

White space in patterns is also good when you have successive patterns
across multiple lines that have parts that are the same and parts that
are different, as in most of these, which is from a function to render
an English headline/book/movie/etc title into its proper casing:

    # put into lowercase if on our stop list, else titlecase
    s/  ( \pL [\pL']* )  /$stoplist{$1} ? lc($1) : ucfirst(lc($1))/xge;

    # capitalize a title's last word and its first word
    s/^ ( \pL [\pL']* )  /\u\L$1/x;  
    s/  ( \pL [\pL']* ) $/\u\L$1/x;  

    # treat parenthesized portion as a complete title
    s/ \( ( \pL [\pL']* )    /(\u\L$1/x;
    s/    ( \pL [\pL']* ) \) /\u\L$1)/x;

    # capitalize first word following colon or semi-colon
    s/ ( [:;] \s+ ) ( \pL [\pL']* ) /$1\u\L$2/x;

Now, that isn't good code for all *kinds* of reasons, but white space
is not one of them.  Perhaps what it is best at demonstrating is why
Python goes about this the right way and that Perl does not.  Oh drat,
I'm about to attach this to the wrong bug.  But it was the dumb code
above that made me think about the following.

By virtue of having a "titlecase each word's first letter and lowercase the
rest" function in Python, you can put the logic in just one place, and
therefore if a bug is found, you can fix all code all at one.

But because Perl has always made it easy to grab "words" (actually,
traditional programming language identifiers) and diddle their case, 
people write this all the time:

    s/(\w+)/\u\L$1/g;

all the time, and that has all kind of problems.  If you prefer the
functional approach, that is really

    s/(\w+)/ucfirst(lc($1))/ge;

but that is still wrong.

 1. Too much code duplication.  Yes, it's nice to see \pL[\pL']* 
    stand out on each line, but shouldn't that be in a variable, like

        $word = qr/\pL[\pL']*/;

 2. What is a "word"?  That code above is better than \w because it
    avoids numbers and underscores; however, it still uses letters
    only, not letters and marks, let alone number letters like Roman
    numerals.

 3. I see the apostrophe there, which is a good start, but what if 
    it is a RIGHT SINGLE QUOTATION MARK, as in "Henry’s"?  And 
    what about hyphens?  Those should not trigger capitalization
    in normal titles.

 4. It turns out that all code that does a titlecase on the first 
    character of a string it has already converted to lowercase has
    irreversibly lost information.  Unicode casing it not reversable.
    Using \w for convenience, these can do different things:

        s/(\w+)/\u\L$1/g;
        s/(\w)(\w*)/\u$1\L$2/g;

    or in the functional approach, 

        s/(\w+)/ucfirst(lc($1))/ge;
        s/(\w)(\w*)/ucfirst($1) . lc($2)/ge;

    Now while it is true that only these code points alone do the wrong 
    thing using the naïve approach under Unicode 6.0:

     % unichars -gas 'ucfirst ne ucfirst lc'
      İ  U+00130 GC=Lu SC=Latin        LATIN CAPITAL LETTER I WITH DOT ABOVE
      ϴ  U+003F4 GC=Lu SC=Greek        GREEK CAPITAL THETA SYMBOL
      ẞ  U+01E9E GC=Lu SC=Latin        LATIN CAPITAL LETTER SHARP S
      Ω  U+02126 GC=Lu SC=Greek        OHM SIGN
      K  U+0212A GC=Lu SC=Latin        KELVIN SIGN
      Å  U+0212B GC=Lu SC=Latin        ANGSTROM SIGN

    But it is still the wrong thing, and we never know what might happen
    in the future.

I think Python is being smarter than Perl in simply providing people
with a titlecase-each-word('s-first-letterand-lowercase-the-rest)in-the-whole-
string function, because this means people won't be tempted to write

    s/(\w+)/ucfirst(lc($1))/ge;

all the time.  However, as I have written elsewhere, I question a lot of
its underlying assumptions.  It's clear that a "word" must in general
include not just Letters but also Marks, or else you get different
results in NFD and NFC, and the Unicode Standard is very against that.

However, the problem is that what a word is cannot be considered
independent of language.  Words in English can contain apostrophes
(whether written as an APOSTROPHE or as RIGHT SINGLE QUOTATION MARK) 
and hyphens (written as HYPHEN-MINUS, HYPHEN, and rarely even EN DASH).

Each of these is a single word:

    ’tisn’t
    anti‐intellectual
    earth–moon

The capitalization there should be 

    ’Tisn’t
    Anti‐intellectual
    Earth–Moon

Notice how you can't do the same with the first apostrophe+t as with the
second on "’Tisn’t"". That is all challenging to code correctly (did you
notice the EN DASH?), especially when you find something like
red‐violet–colored.  You problably want that to be Red‐violet–colored,
because it is not an equal compound like earth–moon or yin–yang, which
in correct orthography take an EN DASH not a HYPHEN, just as occurs
when you hyphenate an already hyphenated word like red‐violet against
colored, as in a red‐violet–colored flower.  English titling rules 
only capitalize the first word in hyphenated words, which is why it's
Anti‐intellectual not Anti-Intellectual.  

And of course, you can't actually create something in true English
titlecase without knowing having a stop list of articles and (short)
prepositions, and paying attention to whether it is the first or last word
in the title, and whether it follows a colon or semicolon.  Consider that
phrasal verbs are construed to take adverbs not prepositions, and so
"Bringing In the Sheaves" would be the correct capitalization of that song,
since "to bring in" is a phrasal verb, but "A Ringing in My Ears" would be
right for that.  It is remarkably complicated.  

With English titlecasing, you have to respect what your publishing house
considers a "short" preposition.  A common cut-off is that short preps
have 4 or fewer characters, but I have seen longer cutoffs.  Here is one
rather exhaustive list of English prepositions sorted by length:

 2: as  at  by  in  of  on  to  up  vs

 3: but  for  off  out  per  pro  qua  via

 4: amid atop down from into like near next onto over
    pace past plus sans save than till upon with

<cutoff point for O'Reilly Media>

 5: about above after among below circa given minus
    round since thru times under until worth

 6: across amidst around before behind beside beside beyond
    during except inside toward unlike versus within

 7: against barring beneath besides between betwixt
    despite failing outside through thruout towards without

10: throughout underneath

The thing is that prepositions become adverbs in phrasal verbs, like "to
go out" or "to come in", and all adverbs are capitalized.  So a complete
solution requires actual parsing of English!!!!  Just say no -- or stronger.

Merely getting something like this right:

    the lord of the rings: the fellowship of the ring  # Unicode lowercase
    THE LORD OF THE RINGS: THE FELLOWSHIP OF THE RING  # Unicode uppercase
    The Lord of the Rings: The Fellowship of the Ring  # English titlecase

is going to take a bit of work.  So is 

    the sad tale of king henry ⅷ   and caterina de aragón  # Unicode lowercase
    THE SAD TALE OF KING HENRY Ⅷ   AND CATERINA DE ARAGÓN  # Unicode uppercase
    The Sad Tale of King Henry Ⅷ   and Caterina de Aragón  # English titlecase

(and that must give the same answer in NFC vs NFD, of course.)

Plus what to do with something like num2ascii is ill-defined in English,
because having digits in the middle of a word is a very new phenomenon.
Yes, Y2K gets caps, but that is for another reason.  There is no agreement
on what one should do with num2ascii or people42see.  A function name
shouldn't be capitalized at all of course.

And that is just English.  Other languages have completely different rules.
For example, per Wikipedia's entry on the colon:

    In Finnish and Swedish, the colon can appear inside words in a
    manner similar to the English apostrophe, between a word (or
    abbreviation, especially an acronym) and its grammatical (mostly
    genitive) suffixes. In Swedish, it also occurs in names, for example
    Antonia Ax:son Johnson (Ax:son for Axelson). In Finnish it is used
    in loanwords and abbreviations; e.g., USA:han for the illative case
    of "USA". For loanwords ending orthographically in a consonant but
    phonetically in a vowel, the apostrophe is used instead: e.g. show'n
    for the genitive case of the English loan "show" or Versailles'n for
    the French place name Versailles.

Isn't that tricky!  I guess that you would have to treat punctuation
that has a word character immediately following it (and immediately 
preceding it) as being part of the word, and that it doesn't signal
that a change in case is merited.

I'm really not sure. It is not obvious what the right thing to do here.

I do believe that Python's titlecase function can and should be fixed to
work correctly with Unicode.  There really is no excuse for turning Aragón
into AragóN, for example, or not doing the right thing with ⅷ   and Ⅷ  .

I fear the only thing you can do with the confusion of Unicode titlecase
and English titlecase is to explain that properly rendering English titles
and headlines is a much more complicated job which you will not even
attempt.  (And shoudln't. English titelcase is clear too specialized for a
general function.)

However, I'm still bothered by things with apostrophes though.

    can't 
    isn't 
    woudn't've
    Bill's
    'tisn't

since I can't countenance the obviously wrong:

    Can'T 
    Isn'T 
    Woudn'T'Ve
    Bill'S
    'Tisn'T

with the last the hardest to get right.  I do have code that correctly
handles English words and code that correctly handles English titles,
but  it is much tricker the titlecase() function.

And Swedes might be upset seeing Antonia Ax:Son Johnson instead 
of Antonia Ax:son Johnson.

Maybe we should just go back to the Pythonic equivalent of 

    s/(\w)(\w*)/ucfirst($1) . lc($2)/ge;

where \w is specifically per tr18's Annex C, and give up on punctuation
altogether, with a footnoted caveat or something.  I wouldn't complain
about that.  The rest is just too, too hard.  Wouldn't you agree?

Thank you very much for all your hard work -- and patience with me.

--tom

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue12753>
_______________________________________