Encoding of surrogate code points to UTF-8
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Tue Oct 8 18:30:41 EDT 2013
On Tue, 08 Oct 2013 15:14:33 +0000, Neil Cerutti wrote:
> In any case, "\ud800\udc01" isn't a valid unicode string.
I don't think this is correct. Can you show me where the standard says
that Unicode strings[1] may not contain surrogates? I think that is a
critical point, and the FAQ conflates *encoded strings* (i.e. bytes using
one of the UTCs) with *Unicode strings*.
The string you give above is is a Unicode string containing two code
points, the surrogates U+D800 U+DC01, which as far as I am concerned is a
legal string (subject to somebody pointing me to a definitive source that
proves it is not). However, it *may or may not* be encodable to bytes
using UTF-8, -16 or -32.
Just as there are byte sequences that cannot be generated by the UTFs,
possibly there are code point sequences that cannot be converted to bytes
using the UTFs.
> In a perfect
> world it would automatically get converted to '\u00010001' without
> intervention.
I certainly hope not, because Unicode string != UTF-16. This is
equivalent to saying:
When encoding the sequence of code points '\ud800\udc01' to UTF-8 bytes,
you should get the same result as if you treated the sequence of code
points as if it were bytes, decoded it using UTF-16, and then encoded
using UTF-8.
That would be a horrible, horrible design, since it privileges UTF-16 in
a completely inappropriate way. I *really* hope I am wrong, but I fear
that is my interpretation of the FAQ.
[1] Sequences of Unicode code points.
--
Steven
More information about the Python-list
mailing list