[Tutor] close, but no cigar

Tue Jul 23 21:23:56 CEST 2013

On 24/07/13 03:01, Marc Tompkins wrote:
> On Tue, Jul 23, 2013 at 7:46 AM, Steven D'Aprano <steve at pearwood.info>wrote:
>
>> This is not quite as silly as saying that an English E, a German E and a
>> French E should be considered three distinct characters, but (in my
>> opinion) not far off it.
>>
>
> I half-agree, half-disagree.  It's true that the letter "E" is used
> more-or-less the same in English, French, and German; after all, they all
> use what's called the "Latin" alphabet, albeit with local variations.  On
> the other hand, the Cyrillic alphabet contains several letters that are
> visually identical to their Latin equivalents, but used quite differently -
> so it's quite appropriate that they're considered different letters, and
> even a different alphabet.

Correct. Even if they were the same, if legacy encoding systems treated them differently, so would Unicode. For example, \N{DIGIT FOUR} and \N{FULLWIDTH DIGIT FOUR} have distinct code-points, even though they are exactly the same character, since some legacy East-Asian encodings had separate characters for "full-width" and "half-width" forms.

But I confess I have misled you. I wrote about the CJK controversy from memory, and I'm afraid I got it completely backwards: the problem is that the glyphs (images of the characters) are different, but not the meaning. Mea culpa.

For example, in English, we can draw the dollar sign $ in two distinct ways, with one vertical line, or two. Unicode treats them as the same character (as do English speakers). "Han Unification" refers to Unicode's choice to do the same for many Han (Chinese, Korean, Japanese) ideographs with different appearance but the same meaning. For various reasons, some technical, some social, this choice proved to be unpopular, particularly in Japan. This issue is nothing new -- Unicode supports about 71,000 distinct East Asian ideographs, which is *far* more than the old legacy encodings were capable of representing, so if there is a Han character that you would like to write which Unicode doesn't support, chances are that neither does any other encoding system.

More here:

https://en.wikipedia.org/wiki/Han_unification
http://www.unicode.org/faq/han_cjk.html
http://slashdot.org/story/01/06/06/0132203/why-unicode-will-work-on-the-internet

-- 
Steven