[Python-ideas] Add "has_surrogates" flags to string object

Wed Oct 9 04:09:11 CEST 2013

Sorry. I don't think what I said contributed to the conversation very well.
Let me try again.

On Tue, Oct 8, 2013 at 5:55 PM, Steven D'Aprano <steve at pearwood.info> wrote:

> On Tue, Oct 08, 2013 at 01:37:54PM -0700, Bruce Leban wrote:
>
> The question isn't what UTF-8 should do with supplmentary characters
> (those outside the BMP). That is well-defined, and Python 3.3 gets it
> right. The question is what it should do with pairs of surrogates.
> Ill-formed surrogates are rightly illegal when encoding to UTF-8:
>
> The only thing that I'm not sure is how to deal with *valid*
> pairs of surrogates:
>
> '\ud800\udc01'.encode('utf-8') should do what?
>
> I don't think that's valid. While it is a sequence of Unicode *codepoints
*(Python definition of unicode string) it is not a sequence of Unicode *
characters*. Arguably, Python should insist that a Unicode string be a
sequence of Unicode characters and reject '\ud800\udc01' at compile time
just as it does '\U01010101' as those are all not valid Unicode characters.
However, I concede that is unlikely to happen.

Here's how I read the FAQ. Most of this FAQ is written in terms of
converting one representation to another. Python strings are not one of
those representations.

A *Unicode transformation format* (UTF) is an algorithmic mapping from
every Unicode code point (except surrogate code points) to a unique byte
sequence.
http://www.unicode.org/faq/utf_bom.html#gen2

To convert UTF-X to UTF-Y, you convert the UTF-X to a sequence of
characters and then convert that to UTF-Y. Note that this excludes
surrogate code points -- they are not representable in the sequence of code
points that a UTF defines.

The definition of UTF-32 says:

Any Unicode character can be represented as a single 32-bit unit in UTF-32.
This single 4 code unit corresponds to the Unicode scalar value, which is
the abstract number associated with a Unicode character.
http://www.unicode.org/faq/utf_bom.html#utf32-1

Thus a surrogate codepoint is NOT allowed in UTF-32 as it is not a
character and if it is encountered it should be treated as an error.

--- Bruce
I'm hiring: http://www.cadencemd.com/info/jobs
Latest blog post: Alice's Puzzle Page http://www.vroospeak.com
Learn how hackers think: http://j.mp/gruyere-security
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20131008/065fa2af/attachment.html>