<br><br><div><span class="gmail_quote">On 9/25/06, <b class="gmail_sendername">Jim Jewett</b> <<a href="mailto:firstname.lastname@example.org">email@example.com</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
As David Hopwood pointed out, to be fully correct, you already have to<br>create a custom function even with bmp characters, because of<br>decomposed characters. (Example: Representing a c-cedilla as a c and<br>a combining cedilla, rather than as a single code point.) Separating
<br>those two would be wrong. Counting them as two characters for slicing<br>purposes would usually be wrong.</blockquote><div><br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Even 32-bit representations are permitted to use surrogate pairs; it<br>just doesn't often make sense.</blockquote><div><br> There is at least one big difference between surrogate pairs and decomposed characters. The user can typically normalize away decompositions. How do you normalize away decompositions in a language that only supports 16-bit representations?
<br></div><br> Paul Prescod<br><br></div>