Encoding of surrogate code points to UTF-8

Wed Oct 9 02:20:05 EDT 2013

On Tue, 08 Oct 2013 21:28:25 -0400, Terry Reedy wrote:

> On 10/8/2013 6:30 PM, Steven D'Aprano wrote:
>> On Tue, 08 Oct 2013 15:14:33 +0000, Neil Cerutti wrote:
>>
>>> In any case, "\ud800\udc01" isn't a valid unicode string.
>>
>> I don't think this is correct. Can you show me where the standard says
>> that Unicode strings[1] may not contain surrogates? I think that is a
> 
> see below.
> 
>> critical point, and the FAQ conflates *encoded strings* (i.e. bytes
>> using one of the UTCs) with *Unicode strings*.
>>
>> The string you give above is is a Unicode string containing two code
>> points, the surrogates U+D800 U+DC01, which as far as I am concerned is
>> a legal string (subject to somebody pointing me to a definitive source
>> that proves it is not). However, it *may or may not* be encodable to
>> bytes using UTF-8, -16 or -32.
> 
>  From chapter two of the standard.
> 
> "Plain text is a pure sequence of character codes; plain Unicode-encoded
> text is therefore a sequence of Unicode character codes."

Also there are many valid non-characters in Unicode, including 66 
explicitly defined non-characters, plus the many surrogates. So defining 
Unicode strings in terms of characters is less than helpful, since it 
excludes a whole bunch of strings which aren't "text" since they include 
non-characters.

Also, "character" in the context of Unicode is ambiguous, due to 
normalization and decomposition: a single character can have up to four 
distinct forms.

http://www.macchiato.com/unicode/nfc-faq

*Code points* are rigorously defined, not characters, which is why I have 
tried very hard to only refer to code points and bytes, not characters.

> http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 "All three
> encoding forms can be used to represent the full range of encoded
> characters in the Unicode Standard; ... Each of the three Unicode
> encoding forms can be efficiently transformed into eith er of the other
> two without any loss of data."

This merely says "encodings encode characters". We know that encodings 
can also encode non-characters, at least *some* non-characters. The 
question is, can they encode surrogates?

> "Surrogates Area. The Surrogates Area contains only surrogate code
> points and no encoded characters. See Section 16.6, Surrogates Area, for
> more detail."
> 
> Before utf-16, the surrogates area was, I believe, part of the Private
> Use Area (which now starts where surrogates end). I think it would have
> been better if they were no longer called code points, but simply utf-16
> code units.

Private Use is irrelevant, since strings certainly can contain Private 
Use code-points, and UTF encodings can encode them.

>> Just as there are byte sequences that cannot be generated by the UTFs,
>> possibly there are code point sequences that cannot be converted to
>> bytes using the UTFs.
> 
> True, but not to the point. You switched from sequences of characters
> (unicode text), which is what both I and Neil are talking about, to
> sequences of codepoints which is a larger set when you include the
> non-character surrogate 'code points' that are not allowed in unicode
> text.

I never mentioned sequences of characters. I've always talked about code 
points.

> http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf#G7404
> 
> "The Unicode Standard supports three character encoding forms: UTF-32,
> UTF-16, and UTF-8. Each encoding form maps the Unicode code points
> U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences."

Ah! Now we're getting somewhere! I think you've hit the nail on the head: 
the three UTF forms explicitly exclude the surrogates. So I think we now 
have an answer:

Surrogate code points can exist in Unicode strings, but cannot be encoded 
to bytes using the standard UTF-8, UTF-16 and UTF-32 encodings.

There may be other encodings, or error handlers, which are capable of 
handling surrogates, but they aren't UTF-8. So I think this answers my 
question. (I reserve the right to change my mind after reading more of 
the standard.)

Thank you to everyone who replied.

-- 
Steven