Grapheme clusters, a.k.a.real characters

Ben Finney ben+python at benfinney.id.au
Sat Jul 15 22:33:10 EDT 2017


MRAB <python at mrabarnett.plus.com> writes:

> You need to be careful about the terminology.

Definitely agreed.
>
> Is linefeed a character? You might call it a "control character", but
> it's not really a _character_, it's control/format _code_.

And yet the ASCII and Unicode standard says code point 0x0A (U+000A LINE
FEED) is a character, by definition.

Rather than saying “no, it's not a character”, I think a more accurate
statement would be: a linefeed *is* a character in ASCII, but that
doesn't mean every other standard must agree.

Indeed it may be better to say: a line feed is a character and is also a
control code.

> Is an acute accent a character?

Yes, according to Unicode. ‘´’ (U+0301 ACUTE ACCENT) is a character.

> No, it's a diacritic mark that's added to a character.

Lose the “no”, and I agree.

The acute accent is a character and *also* is a diacritic mark that is
added to a character. Unicode categorises U+0301 is a character in the
categories “symbol” and “modifier”.

Note that those are not exclusive. It's entirely reasonable for a
concept to fit in multiple categories simultaneously.

What is being revealed in this discussion is the folly of insisting on
exclusive categories for everything, and that terms must have exactly
one meaning.

You are correct that we need to be clear which definition is being used.
But we cannot thereby say that other, different, definitions are
*necessarily* wrong. That is an extra claim that would need to be
demonstrated, and the mere fact of the difference is not sufficient.

-- 
 \          “It's dangerous to be right when the government is wrong.” |
  `\                                   —Francois Marie Arouet Voltaire |
_o__)                                                                  |
Ben Finney




More information about the Python-list mailing list