I misspoke. I meant to ask: "How do you normalize away surrogate pairs in UTF-16?" It was a rhetorical question. The point was just that decomposed characters can be handled by implicit or explicit normalization. Surrogate pairs can only be similarly normalized away if your model allows you to represent their normalized forms. A UTF-16 characters model would not.
<br><br><div><span class="gmail_quote">On 9/26/06, <b class="gmail_sendername">"Martin v. Löwis"</b> <<a href="mailto:email@example.com">firstname.lastname@example.org</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Paul Prescod schrieb:<br>> There is at least one big difference between surrogate pairs and<br>> decomposed characters. The user can typically normalize away<br>> decompositions. How do you normalize away decompositions in a language
<br>> that only supports 16-bit representations?<br><br>I don't see the problem: You use UTF-16; all normal forms (NFC, NFD,<br>NFKC, NFKD) can be represented in UTF-16 just fine.<br><br>It is somewhat tricky to implement a normalization algorithm in
<br>UTF-16, since you must combine surrogate pairs first in order to<br>find out what the canonical decomposition of the code point is;<br>but it's just more code, and no problem in principle.<br><br>Regards,<br>Martin<br>