Encoding of surrogate code points to UTF-8
Steven D'Aprano
steve at pearwood.info
Wed Oct 9 02:20:05 EDT 2013
On Tue, 08 Oct 2013 21:28:25 -0400, Terry Reedy wrote:
> On 10/8/2013 6:30 PM, Steven D'Aprano wrote:
>> On Tue, 08 Oct 2013 15:14:33 +0000, Neil Cerutti wrote:
>>
>>> In any case, "\ud800\udc01" isn't a valid unicode string.
>>
>> I don't think this is correct. Can you show me where the standard says
>> that Unicode strings[1] may not contain surrogates? I think that is a
>
> see below.
>
>> critical point, and the FAQ conflates *encoded strings* (i.e. bytes
>> using one of the UTCs) with *Unicode strings*.
>>
>> The string you give above is is a Unicode string containing two code
>> points, the surrogates U+D800 U+DC01, which as far as I am concerned is
>> a legal string (subject to somebody pointing me to a definitive source
>> that proves it is not). However, it *may or may not* be encodable to
>> bytes using UTF-8, -16 or -32.
>
> From chapter two of the standard.
>
> "Plain text is a pure sequence of character codes; plain Unicode-encoded
> text is therefore a sequence of Unicode character codes."
Also there are many valid non-characters in Unicode, including 66
explicitly defined non-characters, plus the many surrogates. So defining
Unicode strings in terms of characters is less than helpful, since it
excludes a whole bunch of strings which aren't "text" since they include
non-characters.
Also, "character" in the context of Unicode is ambiguous, due to
normalization and decomposition: a single character can have up to four
distinct forms.
http://www.macchiato.com/unicode/nfc-faq
*Code points* are rigorously defined, not characters, which is why I have
tried very hard to only refer to code points and bytes, not characters.
> http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 "All three
> encoding forms can be used to represent the full range of encoded
> characters in the Unicode Standard; ... Each of the three Unicode
> encoding forms can be efficiently transformed into eith er of the other
> two without any loss of data."
This merely says "encodings encode characters". We know that encodings
can also encode non-characters, at least *some* non-characters. The
question is, can they encode surrogates?
> "Surrogates Area. The Surrogates Area contains only surrogate code
> points and no encoded characters. See Section 16.6, Surrogates Area, for
> more detail."
>
> Before utf-16, the surrogates area was, I believe, part of the Private
> Use Area (which now starts where surrogates end). I think it would have
> been better if they were no longer called code points, but simply utf-16
> code units.
Private Use is irrelevant, since strings certainly can contain Private
Use code-points, and UTF encodings can encode them.
>> Just as there are byte sequences that cannot be generated by the UTFs,
>> possibly there are code point sequences that cannot be converted to
>> bytes using the UTFs.
>
> True, but not to the point. You switched from sequences of characters
> (unicode text), which is what both I and Neil are talking about, to
> sequences of codepoints which is a larger set when you include the
> non-character surrogate 'code points' that are not allowed in unicode
> text.
I never mentioned sequences of characters. I've always talked about code
points.
> http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf#G7404
>
> "The Unicode Standard supports three character encoding forms: UTF-32,
> UTF-16, and UTF-8. Each encoding form maps the Unicode code points
> U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences."
Ah! Now we're getting somewhere! I think you've hit the nail on the head:
the three UTF forms explicitly exclude the surrogates. So I think we now
have an answer:
Surrogate code points can exist in Unicode strings, but cannot be encoded
to bytes using the standard UTF-8, UTF-16 and UTF-32 encodings.
There may be other encodings, or error handlers, which are capable of
handling surrogates, but they aren't UTF-8. So I think this answers my
question. (I reserve the right to change my mind after reading more of
the standard.)
Thank you to everyone who replied.
--
Steven
More information about the Python-list
mailing list