Add "has_surrogates" flags to string object

Here is an idea about adding a mark to PyUnicode object which allows fast answer to the question if a string has surrogate code. This mark has one of three possible states: * String doesn't contain surrogates. * String contains surrogates. * It is still unknown. We can combine this with "is_ascii" flag in 2-bit value: * String is ASCII-only (and doesn't contain surrogates). * String is not ASCII-only and doesn't contain surrogates. * String is not ASCII-only and contains surrogates. * String is not ASCII-only and it is still unknown if it contains surrogate. By default a string is created in "unknown" state (if it is UCS2 or UCS4). After first request it can be switched to "has surrogates" or "hasn't surrogates". State of the result of concatenating or slicing can be determined from states of input strings. This will allow faster UTF-16 and UTF-32 encoding (and perhaps even a little faster UTF-8 encoding) and converting to wchar_t* if string hasn't surrogates (this is true in most cases).

On 2013-10-08, at 13:43 , Serhiy Storchaka wrote:
I don't know the details of the flexible string representation, but I believed the names fit what was actually in memory. UCS2 does not have surrogate pairs, thus surrogate codes make no sense in UCS2, they're a UTF-16 concept. Likewise for UCS4. Surrogate codes are not codepoints, they have no reason to appear in either UCS2 or UCS4 outside of encoding errors.
UCS2 string without surrogate codes can be encoded in UTF-16 by memcpy().
Surrogate codes prevent that (modulo objections above) for slicing (not that it's a big issue I think, a guard can just check whether it's slicing within a surrogate pair, that only requires checking the first and last 2 bytes of the range) but not for concatenation right?

On Tue, Oct 08, 2013 at 01:58:20PM +0200, Masklinn wrote:
[...]
I welcome correction, but I think you're mistaken. Python 3.3 strings don't have surrogate *pairs*, but they can contain surrogate *code points*. Unicode states: "Isolated surrogate code points have no interpretation; consequently, no character code charts or names lists are provided for this range." http://www.unicode.org/charts/PDF/UDC00.pdf http://www.unicode.org/charts/PDF/UD800.pdf So technically surrogates are "non-characters". That doesn't mean they are forbidden though; you can certainly create them, and encode them to UTF-16 and -32: py> surr = '\udc80' py> import unicodedata as ud py> ud.category(surr) 'Cs' py> surr.encode('utf-16') b'\xff\xfe\x80\xdc' py> surr.encode('utf-32') b'\xff\xfe\x00\x00\x80\xdc\x00\x00' However, you cannot encode single surrogates to UTF-8: py> surr.encode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed as per the standard: http://www.unicode.org/faq/utf_bom.html#utf8-5 I *think* you are supposed to be able to encode surrogate *pairs* to UTF-8, if I'm reading the FAQ correctly, but it seems Python 3.3 doesn't support that. In any case, it is certainly legal to have Unicode strings containing non-characters, including surrogates, and you can encode them to UTF-16 and -32. However, it looks like surrogates won't round trip in UTF-16, but they will in UTF-32: py> surr.encode('utf-16').decode('utf-16') == surr Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf16' codec can't decode bytes in position 2-3: unexpected end of data py> surr.encode('utf-32').decode('utf-32') == surr True So... I'm not sure why this will be useful. Presumably Unicode strings containing surrogate code points will be rare, and you can't encode them to UTF-8 at all, and you can't round trip them from UTF-16. -- Steven

On 2013-10-08, at 15:02 , Steven D'Aprano wrote: [snipped early part as any response would be superseded by or redundant with the stuff below]
I'm reading the opposite, from http://www.unicode.org/faq/utf_bom.html#utf8-4:
Pairs of 3-byte sequences would be encoding each surrogate directly to UTF-8, whereas a single 4-byte sequence would be decoding the surrogate pair to a codepoint and encoding that codepoint to UTF-8. My reading of the FAQ makes the second interpretation the only valid one. So you can't encode surrogates (either lone or paired) to UTF-8, you can encode the codepoint encoded by a surrogate pair.
The UTF-32 section has similar note to UTF-8: http://www.unicode.org/faq/utf_bom.html#utf32-7
and the UTF-16 section points out: http://www.unicode.org/faq/utf_bom.html#utf16-7
Q: Are there any 16-bit values that are invalid?
As far as I can read the FAQ, it is always invalid to encode a surrogate, surrogates are not to be considered codepoints (they're not just noncharacters[0], noncharacters are codepoints), and a lone surrogate in a UTF-16 stream means the stream is corrupted, which should result in an error during transcoding to anything (unless some recovery mode is used to replace corrupted characters by some mark during decoding I guess).
So... I'm not sure why this will be useful. Presumably Unicode strings containing surrogate code points will be rare
And they're a sign of corrupted stream. The FAQ reads a bit strangely, I think because it's written from the viewpoint that the "internal encoding" will be UTF-16, and UTF-8 and UTF-32 are transcoding from that. Which does not apply to CPython and the FSR. Parsing the FAQ with that viewpoint, I believe a CPython string (unicode) must not contain surrogate codes: a surrogate pair should have been decoded from UTF-16 to a codepoint (then identity-encoded to UCS4) and a single surrogate should have been caught by the UTF-16 decoder and should have triggered the error handler at that point. A surrogate code in a CPython string means the string is corrupted[1]. Surrogates *may* appear in binary data, while building a UTF-16 bytestream by hand. [0] since "noncharacter" has a well-defined meaning in unicode, and only applies to 66 codepoints, a much smaller range than surrogates: http://www.unicode.org/faq/private_use.html#noncharacters [1] note that this hinges on my understanding of "UCS2" in FSR being actual UCS2, if it's UCS2-with-surrogates with a heuristic for switching between UCS2 and UCS4 depending on the number of surrogate pairs in the string it does not apply

On Tue, Oct 08, 2013 at 03:48:18PM +0200, Masklinn wrote:
On 2013-10-08, at 15:02 , Steven D'Aprano wrote:
It's not that clear to me. I fear the Unicode FAQs don't distinguish between Unicode strings and bytes well enough for my liking :( But for the record, my interpretion is that if you have a pair of code points constisting of the same values as a valid surrogate pair, you should be able to encode to UTF-8. To give a concrete example: Given: c = '\N{LINEAR B SYLLABLE B038 E}' # \U00010001 c.encode('utf-8') => b'\xf0\x90\x80\x81' and: c.encode('utf-16BE') # encodes as a surrogate pair => b'\xd8\x00\xdc\x01' then those same surrogates, taken as codepoints, should be encodable as UTF-8: '\ud800\udc01'.encode('utf-8') => b'\xf0\x90\x80\x81' I'd actually be disappointed if that were the case; I think that would be a poor design. But if that's what the Unicode standard demands, Python ought to support it. But hopefully somebody will explain to me why my interpretation is wrong :-) [...]
Hmmm... well, that might explain it. If it's written by Java programmers for Java programmers, they may very well decide that having spent 20 years trying to convince people that string != ASCII, they're now going to convince them that string == UTF-16 instead :/
I think that interpretation is a bit strong. I think it would be fair to say that CPython strings may contain surrogates, but you can't encode them to bytes using the UTFs. Nor are there any byte sequences that can be decoded to surrogates using the UTFs. This essentially means that you can only get surrogates in a string using (e.g.) chr() or \u escapes, and you can't then encode them to bytes using UTF encodings.
Surrogates *may* appear in binary data, while building a UTF-16 bytestream by hand.
But there you're talking about bytes, not byte strings. Byte strings can contain any bytes you like :-) -- Steven

On 2013-10-08, at 16:20 , Steven D'Aprano wrote
That would be really weird, it'd mean an *encoder* has to translate a surrogate pair into the actual codepoint in some sort of weird UTF-specific normalization pass.
To be fair, it's not just java programmers, IIRC ICU uses UTF-16 as the internal encoding.
Yes, that's basically what I mean: I think surrogates only make sense in a bytestream, not in a unicode stream. Although I did not remember/was not aware of PEP 383 (thank you Stephen) which makes the Unicode spec irrelevant to what Python string contains. On 2013-10-08, at 16:31 , Stephen J. Turnbull wrote:
noncharacters are a very different case for what it's worth, their own FAQ clearly notes that they are valid full-fledged codepoints and must be encoded and preserved by UTFs: http://www.unicode.org/faq/private_use.html#nonchar7

On Tue, Oct 8, 2013 at 7:20 AM, Steven D'Aprano <steve@pearwood.info> wrote:
The FAQ is explicit that this is wrong: "The definition of UTF-8 requires that supplementary characters (those using surrogate pairs in UTF-16) be encoded with a single four byte sequence." http://www.unicode.org/faq/utf_bom.html#utf8-4 It goes on to say that there is a widespread practice of doing it anyway in older software. Therefore, it might be acceptable to accept these mis-encoded characters when *decoding* but they should never be generated when *encoding*. I'd prefer not to have that on by default given the history of overlong UTF-8 bugs (e.g., see http://blogs.msdn.com/b/michael_howard/archive/2008/08/22/overlong-utf-8-esc...). Essentially if different decoders follow different rules, then you can sometimes sneak stuff through the permissive decoders. Notwithstanding that, there is a different unicode encoding CESU-8 which does the opposite: it always encodes those characters requiring surrogate pairs as 6 bytes consisting of two UTF-8-style encodings of the individual surrogate codepoints. Python doesn't support this and the request to support it was rejected: http://bugs.python.org/issue12742 --- Bruce I'm hiring: http://www.cadencemd.com/info/jobs Latest blog post: Alice's Puzzle Page http://www.vroospeak.com Learn how hackers think: http://j.mp/gruyere-security

Bruce Leban wrote:
Python's internal string representation is not UTF-16, though, so this doesn't apply directly. Seems to me it hinges on whether a pair of surrogate code points appearing in a Python string are meant to represent a single character or not. I would say not, because otherwise they would have been stored as a single code unit. -- Greg

On Tue, Oct 08, 2013 at 01:37:54PM -0700, Bruce Leban wrote:
And if you count the number of bytes, you will see four of them: '\ud800\udc01'.encode('utf-8') => b'\xf0' b'\x90' b'\x80' b'\x81' I stress that Python 3.3 doesn't actually do this, but my reading of the FAQ suggests that it should. The question isn't what UTF-8 should do with supplmentary characters (those outside the BMP). That is well-defined, and Python 3.3 gets it right. The question is what it should do with pairs of surrogates. Ill-formed surrogates are rightly illegal when encoding to UTF-8: # a lone surrogate is illegal '\ud800'.encode('utf-8') must be treated as an error # two high surrogates, or two low surrogates '\udc01\udc01'.encode('utf-8') must be treated as an error '\ud800\ud800'.encode('utf-8') must be treated as an error # if they're in the wrong order '\udc01\ud800'.encode('utf-8') must be treated as an error The only thing that I'm not sure is how to deal with *valid* pairs of surrogates: '\ud800\udc01'.encode('utf-8') should do what? I personally would hope that this too should raise, which is Python's current behaviour, but my reading of the FAQs is that it should be treated as if there were an implicit UTF-16 conversion. (I hope I'm wrong!) That is: 1) treat the sequence of code points as if it were a sequence of two 16-bit values b'\xd8\x00' b'\xdc\x01' 2) implicitly decode it using UTF-16 to get U+10001 3) encode U+10001 using UTF-8 to get b'\xf0\x90\x80\x81' That would be (in my opinion) *horrible*, but that's my reading of the Unicode FAQ. The question asks: "How do I convert a UTF-16 surrogate pair such as <D800 DC00> to UTF-8?" and the answer seems to be: "The definition of UTF-8 requires that supplementary characters (those using surrogate pairs in UTF-16) be encoded with a single four byte sequence." which doesn't actually answer the question (the question is about SURROGATE PAIRS, the answer is about SUPPLEMENTARY CHARACTERS) but suggests the above horrible interpretation. What I'm hoping for is a definite source that explains what the UTF-8 encoder is supposed to do with a Unicode string containing surrogates. (And presumably the other UTF encoders as well, although I haven't tried thinking about them yet.)
They are talking about the practice of generating six bytes, two three-byte sequences. You should notice that I'm not generating six bytes anywhere. -- Steven

Sorry. I don't think what I said contributed to the conversation very well. Let me try again. On Tue, Oct 8, 2013 at 5:55 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Here's how I read the FAQ. Most of this FAQ is written in terms of converting one representation to another. Python strings are not one of those representations. A *Unicode transformation format* (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. http://www.unicode.org/faq/utf_bom.html#gen2 To convert UTF-X to UTF-Y, you convert the UTF-X to a sequence of characters and then convert that to UTF-Y. Note that this excludes surrogate code points -- they are not representable in the sequence of code points that a UTF defines. The definition of UTF-32 says: Any Unicode character can be represented as a single 32-bit unit in UTF-32. This single 4 code unit corresponds to the Unicode scalar value, which is the abstract number associated with a Unicode character. http://www.unicode.org/faq/utf_bom.html#utf32-1 Thus a surrogate codepoint is NOT allowed in UTF-32 as it is not a character and if it is encountered it should be treated as an error. --- Bruce I'm hiring: http://www.cadencemd.com/info/jobs Latest blog post: Alice's Puzzle Page http://www.vroospeak.com Learn how hackers think: http://j.mp/gruyere-security

On 10/8/2013 8:55 PM, Steven D'Aprano wrote:
And I already explained on python-list why that reading is wrong; transcoding a utf-16 string (sequence of 2-byte words, subject to validity rules) is different from encoding unicode text (character sequence, and surrogates are not characters). A utf-16 to utf-8 transcoder should (must) do the above, but in 3.3+, the utf-8 codec is no longer the utf-16 trancoder that it effectively was for narrow builds. Each utf form defines a one to one mapping between unicode texts and valid code unit sequences. (Unicode Standard, Chapter 3, definition D79.) Having both '\U00010001' and '\ud800\udc01' map to b'\xf0\x90\x80\x81' would violate that important property. '\ud800\udc01' represents a character in utf-16 but not in python's flexible string representation. The latter uses one code unit (of variable size per string) per character, instead of a variable number of code units (of one size for all strings) per character. Because machines have not conceptual, visual, or aural memory, but only byte memory, they must re-encode abstract characters to bytes to remember them. In pre 3.3 narrow builds, where utf-16 was used internally, decoding and encoding amounted to transcoding bytes encodings into the utf-16 encoding, and vice versa. So utf-8 b'\xf0\x90\x80\x81' and utf-16 '\ud800\udc01' were mapped into each other. Whether the mapping was done directly or indirectly, via the character codepoint value, did not matter to the user. In any case FSR no longer uses multiple-code-unit encodings internally, and '\ud800\udc01', even though allowed for practical reasons, does not represent and is not the same as '\U00010001'. The proposed 'has_surrogates' flag amounts to an 'not strictly valid' flag. Only the FSR implementors can decide if it is worth the trouble. -- Terry Jan Reedy

Steven D'Aprano writes:
According to PEP 383, which provides a special mechanism for roundtripping input that claims to be a particular encoding but does not conform to that encoding, when encoding to UTF-8, if the errors= parameter *is* surrogateescape *and* the value is in the first row of the low surrogate range, it is masked by 0xff and emitted as a single byte. In all other cases of surrogates, it should raise an error. A conforming Unicode codec must not emit UTF-8 which would decode to a surrogate. These cases can occur in valid Python programs because chr() is unconstrained (for example). On input, Unicode conformance means that when using the surrogateescape handler, an alleged UTF-8 stream containing a 6-byte sequence that would algorithmically decode to a surrogate pair should be represented internally as a sequence of 6 surrogates from the first row of the low surrogate range. If the surrogateescape handler is not in use, it should raise an error. Sorry about not testing actual behavior, gotta run to a meeting. I forget what PEP 383 says about other Unicode codecs.

Masklinn writes:
No, it's written from the viewpoint that it says *nothing* about internal encodings, only about the encodings used in interchange of textual data, and about certain aspects of the processes that may receive and generate such data (eg, when data matches a Unicode regular expression, or how bidirectional text should appear visually).
Parsing the FAQ with that viewpoint, I believe a CPython string (unicode) must not contain surrogate codes:
No, it says no such thing. All the Unicode Standard (and the FAQ) says is that if Python generates output that purports to be text encoded in Unicode, it may not contain surrogate codes except where those codes are used according to UTF-16 to encode characters in planes 2 to 17, and if it receives data alleged to be Unicode in some transformation format, it must raise an error if it receives surrogates other than a correctly formed surrogate pair in text known to be encoded as UTF-16. In fact (as I wrote before without proper citation), the internal encoding of Python has been extended by PEP 383 to use a subset of the surrogate space to represent undecodable bytes in an octet stream, when the error handler is set to "surrogateescape". Furthermore, there is nothing to stop a Python unicode from containing any code unit (including both surrogates and other non-characters like 0xFFFF). Checking of the rules you cite is done by codecs, at encoding and decoding time.

08.10.13 16:02, Steven D'Aprano написав(ла):
So... I'm not sure why this will be useful.
This is a bug. http://bugs.python.org/issue12892

Masklinn writes:
True, but Python doesn't actually use UCS2 or UCS4 internally. It uses UCS2 or UCS4 plus a row of codes from the surrogate area to represent undecodable bytes. This feature is optional (enabled by using the appropriate error= setting in the codec), but I don't suppose it's going to go away.

Le Tue, 08 Oct 2013 14:17:59 +0300, Serhiy Storchaka <storchaka@gmail.com> a écrit :
Not true for slicing (you can take a non-surrogates slice of a surrogates string). Other than that, this sounds reasonable to me, provided that the patch isn't too complex and the perf improvements are worth it. Regards Antoine.

On 08.10.2013 13:17, Serhiy Storchaka wrote:
I guess you could use one bit from the kind structure for that: /* Character size: - PyUnicode_WCHAR_KIND (0): * character type = wchar_t (16 or 32 bits, depending on the platform) - PyUnicode_1BYTE_KIND (1): * character type = Py_UCS1 (8 bits, unsigned) * all characters are in the range U+0000-U+00FF (latin1) * if ascii is set, all characters are in the range U+0000-U+007F (ASCII), otherwise at least one character is in the range U+0080-U+00FF - PyUnicode_2BYTE_KIND (2): * character type = Py_UCS2 (16 bits, unsigned) * all characters are in the range U+0000-U+FFFF (BMP) * at least one character is in the range U+0100-U+FFFF - PyUnicode_4BYTE_KIND (4): * character type = Py_UCS4 (32 bits, unsigned) * all characters are in the range U+0000-U+10FFFF * at least one character is in the range U+10000-U+10FFFF */ unsigned int kind:3; For some reason, it allocates 3 bits, but only 2 bits are used. The again, the state struct is unsigned int, so there's still plenty of room for extra flags. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 08 2013)
2013-10-14: PyCon DE 2013, Cologne, Germany ... 6 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 08.10.2013 18:16, Serhiy Storchaka wrote:
Ok, then just add the flag to the end of the list... we'd still have at least 7 bits left on most platforms, IICC. PS: I guess this use of kind should be documented clearly somewhere. The unicodeobject.h file only hints at this and for PyUnicode_WCHAR_KIND this interpretation cannot be used. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 08 2013)
2013-10-14: PyCon DE 2013, Cologne, Germany ... 6 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

I like the idea. I prefer to add another flag (1 bit), instead of having a complex with 4 different values. Your idea looks specific to the PEP 393, so I prefer to keep the flag private. Otherwise it would be hard for other implementations of Python to implement the function getting the flag value. Victor 2013/10/8 Serhiy Storchaka <storchaka@gmail.com>:

08.10.13 15:23, Victor Stinner написав(ла):
I like the idea. I prefer to add another flag (1 bit), instead of having a complex with 4 different values.
We need at least 3-states value: yes, no, may be. But combining with is_ascii flag we need only one additional bit. I think that it shouldn't be more complex.
Yes, of course.

2013/10/8 Serhiy Storchaka <storchaka@gmail.com>:
Knowing if a string contains any surrogate character would also speedup marshal and pickle modules: http://bugs.python.org/issue19219#msg199465 Victor

On 2013-10-08, at 13:43 , Serhiy Storchaka wrote:
I don't know the details of the flexible string representation, but I believed the names fit what was actually in memory. UCS2 does not have surrogate pairs, thus surrogate codes make no sense in UCS2, they're a UTF-16 concept. Likewise for UCS4. Surrogate codes are not codepoints, they have no reason to appear in either UCS2 or UCS4 outside of encoding errors.
UCS2 string without surrogate codes can be encoded in UTF-16 by memcpy().
Surrogate codes prevent that (modulo objections above) for slicing (not that it's a big issue I think, a guard can just check whether it's slicing within a surrogate pair, that only requires checking the first and last 2 bytes of the range) but not for concatenation right?

On Tue, Oct 08, 2013 at 01:58:20PM +0200, Masklinn wrote:
[...]
I welcome correction, but I think you're mistaken. Python 3.3 strings don't have surrogate *pairs*, but they can contain surrogate *code points*. Unicode states: "Isolated surrogate code points have no interpretation; consequently, no character code charts or names lists are provided for this range." http://www.unicode.org/charts/PDF/UDC00.pdf http://www.unicode.org/charts/PDF/UD800.pdf So technically surrogates are "non-characters". That doesn't mean they are forbidden though; you can certainly create them, and encode them to UTF-16 and -32: py> surr = '\udc80' py> import unicodedata as ud py> ud.category(surr) 'Cs' py> surr.encode('utf-16') b'\xff\xfe\x80\xdc' py> surr.encode('utf-32') b'\xff\xfe\x00\x00\x80\xdc\x00\x00' However, you cannot encode single surrogates to UTF-8: py> surr.encode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed as per the standard: http://www.unicode.org/faq/utf_bom.html#utf8-5 I *think* you are supposed to be able to encode surrogate *pairs* to UTF-8, if I'm reading the FAQ correctly, but it seems Python 3.3 doesn't support that. In any case, it is certainly legal to have Unicode strings containing non-characters, including surrogates, and you can encode them to UTF-16 and -32. However, it looks like surrogates won't round trip in UTF-16, but they will in UTF-32: py> surr.encode('utf-16').decode('utf-16') == surr Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf16' codec can't decode bytes in position 2-3: unexpected end of data py> surr.encode('utf-32').decode('utf-32') == surr True So... I'm not sure why this will be useful. Presumably Unicode strings containing surrogate code points will be rare, and you can't encode them to UTF-8 at all, and you can't round trip them from UTF-16. -- Steven

On 2013-10-08, at 15:02 , Steven D'Aprano wrote: [snipped early part as any response would be superseded by or redundant with the stuff below]
I'm reading the opposite, from http://www.unicode.org/faq/utf_bom.html#utf8-4:
Pairs of 3-byte sequences would be encoding each surrogate directly to UTF-8, whereas a single 4-byte sequence would be decoding the surrogate pair to a codepoint and encoding that codepoint to UTF-8. My reading of the FAQ makes the second interpretation the only valid one. So you can't encode surrogates (either lone or paired) to UTF-8, you can encode the codepoint encoded by a surrogate pair.
The UTF-32 section has similar note to UTF-8: http://www.unicode.org/faq/utf_bom.html#utf32-7
and the UTF-16 section points out: http://www.unicode.org/faq/utf_bom.html#utf16-7
Q: Are there any 16-bit values that are invalid?
As far as I can read the FAQ, it is always invalid to encode a surrogate, surrogates are not to be considered codepoints (they're not just noncharacters[0], noncharacters are codepoints), and a lone surrogate in a UTF-16 stream means the stream is corrupted, which should result in an error during transcoding to anything (unless some recovery mode is used to replace corrupted characters by some mark during decoding I guess).
So... I'm not sure why this will be useful. Presumably Unicode strings containing surrogate code points will be rare
And they're a sign of corrupted stream. The FAQ reads a bit strangely, I think because it's written from the viewpoint that the "internal encoding" will be UTF-16, and UTF-8 and UTF-32 are transcoding from that. Which does not apply to CPython and the FSR. Parsing the FAQ with that viewpoint, I believe a CPython string (unicode) must not contain surrogate codes: a surrogate pair should have been decoded from UTF-16 to a codepoint (then identity-encoded to UCS4) and a single surrogate should have been caught by the UTF-16 decoder and should have triggered the error handler at that point. A surrogate code in a CPython string means the string is corrupted[1]. Surrogates *may* appear in binary data, while building a UTF-16 bytestream by hand. [0] since "noncharacter" has a well-defined meaning in unicode, and only applies to 66 codepoints, a much smaller range than surrogates: http://www.unicode.org/faq/private_use.html#noncharacters [1] note that this hinges on my understanding of "UCS2" in FSR being actual UCS2, if it's UCS2-with-surrogates with a heuristic for switching between UCS2 and UCS4 depending on the number of surrogate pairs in the string it does not apply

On Tue, Oct 08, 2013 at 03:48:18PM +0200, Masklinn wrote:
On 2013-10-08, at 15:02 , Steven D'Aprano wrote:
It's not that clear to me. I fear the Unicode FAQs don't distinguish between Unicode strings and bytes well enough for my liking :( But for the record, my interpretion is that if you have a pair of code points constisting of the same values as a valid surrogate pair, you should be able to encode to UTF-8. To give a concrete example: Given: c = '\N{LINEAR B SYLLABLE B038 E}' # \U00010001 c.encode('utf-8') => b'\xf0\x90\x80\x81' and: c.encode('utf-16BE') # encodes as a surrogate pair => b'\xd8\x00\xdc\x01' then those same surrogates, taken as codepoints, should be encodable as UTF-8: '\ud800\udc01'.encode('utf-8') => b'\xf0\x90\x80\x81' I'd actually be disappointed if that were the case; I think that would be a poor design. But if that's what the Unicode standard demands, Python ought to support it. But hopefully somebody will explain to me why my interpretation is wrong :-) [...]
Hmmm... well, that might explain it. If it's written by Java programmers for Java programmers, they may very well decide that having spent 20 years trying to convince people that string != ASCII, they're now going to convince them that string == UTF-16 instead :/
I think that interpretation is a bit strong. I think it would be fair to say that CPython strings may contain surrogates, but you can't encode them to bytes using the UTFs. Nor are there any byte sequences that can be decoded to surrogates using the UTFs. This essentially means that you can only get surrogates in a string using (e.g.) chr() or \u escapes, and you can't then encode them to bytes using UTF encodings.
Surrogates *may* appear in binary data, while building a UTF-16 bytestream by hand.
But there you're talking about bytes, not byte strings. Byte strings can contain any bytes you like :-) -- Steven

On 2013-10-08, at 16:20 , Steven D'Aprano wrote
That would be really weird, it'd mean an *encoder* has to translate a surrogate pair into the actual codepoint in some sort of weird UTF-specific normalization pass.
To be fair, it's not just java programmers, IIRC ICU uses UTF-16 as the internal encoding.
Yes, that's basically what I mean: I think surrogates only make sense in a bytestream, not in a unicode stream. Although I did not remember/was not aware of PEP 383 (thank you Stephen) which makes the Unicode spec irrelevant to what Python string contains. On 2013-10-08, at 16:31 , Stephen J. Turnbull wrote:
noncharacters are a very different case for what it's worth, their own FAQ clearly notes that they are valid full-fledged codepoints and must be encoded and preserved by UTFs: http://www.unicode.org/faq/private_use.html#nonchar7

On Tue, Oct 8, 2013 at 7:20 AM, Steven D'Aprano <steve@pearwood.info> wrote:
The FAQ is explicit that this is wrong: "The definition of UTF-8 requires that supplementary characters (those using surrogate pairs in UTF-16) be encoded with a single four byte sequence." http://www.unicode.org/faq/utf_bom.html#utf8-4 It goes on to say that there is a widespread practice of doing it anyway in older software. Therefore, it might be acceptable to accept these mis-encoded characters when *decoding* but they should never be generated when *encoding*. I'd prefer not to have that on by default given the history of overlong UTF-8 bugs (e.g., see http://blogs.msdn.com/b/michael_howard/archive/2008/08/22/overlong-utf-8-esc...). Essentially if different decoders follow different rules, then you can sometimes sneak stuff through the permissive decoders. Notwithstanding that, there is a different unicode encoding CESU-8 which does the opposite: it always encodes those characters requiring surrogate pairs as 6 bytes consisting of two UTF-8-style encodings of the individual surrogate codepoints. Python doesn't support this and the request to support it was rejected: http://bugs.python.org/issue12742 --- Bruce I'm hiring: http://www.cadencemd.com/info/jobs Latest blog post: Alice's Puzzle Page http://www.vroospeak.com Learn how hackers think: http://j.mp/gruyere-security

Bruce Leban wrote:
Python's internal string representation is not UTF-16, though, so this doesn't apply directly. Seems to me it hinges on whether a pair of surrogate code points appearing in a Python string are meant to represent a single character or not. I would say not, because otherwise they would have been stored as a single code unit. -- Greg

On Tue, Oct 08, 2013 at 01:37:54PM -0700, Bruce Leban wrote:
And if you count the number of bytes, you will see four of them: '\ud800\udc01'.encode('utf-8') => b'\xf0' b'\x90' b'\x80' b'\x81' I stress that Python 3.3 doesn't actually do this, but my reading of the FAQ suggests that it should. The question isn't what UTF-8 should do with supplmentary characters (those outside the BMP). That is well-defined, and Python 3.3 gets it right. The question is what it should do with pairs of surrogates. Ill-formed surrogates are rightly illegal when encoding to UTF-8: # a lone surrogate is illegal '\ud800'.encode('utf-8') must be treated as an error # two high surrogates, or two low surrogates '\udc01\udc01'.encode('utf-8') must be treated as an error '\ud800\ud800'.encode('utf-8') must be treated as an error # if they're in the wrong order '\udc01\ud800'.encode('utf-8') must be treated as an error The only thing that I'm not sure is how to deal with *valid* pairs of surrogates: '\ud800\udc01'.encode('utf-8') should do what? I personally would hope that this too should raise, which is Python's current behaviour, but my reading of the FAQs is that it should be treated as if there were an implicit UTF-16 conversion. (I hope I'm wrong!) That is: 1) treat the sequence of code points as if it were a sequence of two 16-bit values b'\xd8\x00' b'\xdc\x01' 2) implicitly decode it using UTF-16 to get U+10001 3) encode U+10001 using UTF-8 to get b'\xf0\x90\x80\x81' That would be (in my opinion) *horrible*, but that's my reading of the Unicode FAQ. The question asks: "How do I convert a UTF-16 surrogate pair such as <D800 DC00> to UTF-8?" and the answer seems to be: "The definition of UTF-8 requires that supplementary characters (those using surrogate pairs in UTF-16) be encoded with a single four byte sequence." which doesn't actually answer the question (the question is about SURROGATE PAIRS, the answer is about SUPPLEMENTARY CHARACTERS) but suggests the above horrible interpretation. What I'm hoping for is a definite source that explains what the UTF-8 encoder is supposed to do with a Unicode string containing surrogates. (And presumably the other UTF encoders as well, although I haven't tried thinking about them yet.)
They are talking about the practice of generating six bytes, two three-byte sequences. You should notice that I'm not generating six bytes anywhere. -- Steven

Sorry. I don't think what I said contributed to the conversation very well. Let me try again. On Tue, Oct 8, 2013 at 5:55 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Here's how I read the FAQ. Most of this FAQ is written in terms of converting one representation to another. Python strings are not one of those representations. A *Unicode transformation format* (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. http://www.unicode.org/faq/utf_bom.html#gen2 To convert UTF-X to UTF-Y, you convert the UTF-X to a sequence of characters and then convert that to UTF-Y. Note that this excludes surrogate code points -- they are not representable in the sequence of code points that a UTF defines. The definition of UTF-32 says: Any Unicode character can be represented as a single 32-bit unit in UTF-32. This single 4 code unit corresponds to the Unicode scalar value, which is the abstract number associated with a Unicode character. http://www.unicode.org/faq/utf_bom.html#utf32-1 Thus a surrogate codepoint is NOT allowed in UTF-32 as it is not a character and if it is encountered it should be treated as an error. --- Bruce I'm hiring: http://www.cadencemd.com/info/jobs Latest blog post: Alice's Puzzle Page http://www.vroospeak.com Learn how hackers think: http://j.mp/gruyere-security

On 10/8/2013 8:55 PM, Steven D'Aprano wrote:
And I already explained on python-list why that reading is wrong; transcoding a utf-16 string (sequence of 2-byte words, subject to validity rules) is different from encoding unicode text (character sequence, and surrogates are not characters). A utf-16 to utf-8 transcoder should (must) do the above, but in 3.3+, the utf-8 codec is no longer the utf-16 trancoder that it effectively was for narrow builds. Each utf form defines a one to one mapping between unicode texts and valid code unit sequences. (Unicode Standard, Chapter 3, definition D79.) Having both '\U00010001' and '\ud800\udc01' map to b'\xf0\x90\x80\x81' would violate that important property. '\ud800\udc01' represents a character in utf-16 but not in python's flexible string representation. The latter uses one code unit (of variable size per string) per character, instead of a variable number of code units (of one size for all strings) per character. Because machines have not conceptual, visual, or aural memory, but only byte memory, they must re-encode abstract characters to bytes to remember them. In pre 3.3 narrow builds, where utf-16 was used internally, decoding and encoding amounted to transcoding bytes encodings into the utf-16 encoding, and vice versa. So utf-8 b'\xf0\x90\x80\x81' and utf-16 '\ud800\udc01' were mapped into each other. Whether the mapping was done directly or indirectly, via the character codepoint value, did not matter to the user. In any case FSR no longer uses multiple-code-unit encodings internally, and '\ud800\udc01', even though allowed for practical reasons, does not represent and is not the same as '\U00010001'. The proposed 'has_surrogates' flag amounts to an 'not strictly valid' flag. Only the FSR implementors can decide if it is worth the trouble. -- Terry Jan Reedy

Steven D'Aprano writes:
According to PEP 383, which provides a special mechanism for roundtripping input that claims to be a particular encoding but does not conform to that encoding, when encoding to UTF-8, if the errors= parameter *is* surrogateescape *and* the value is in the first row of the low surrogate range, it is masked by 0xff and emitted as a single byte. In all other cases of surrogates, it should raise an error. A conforming Unicode codec must not emit UTF-8 which would decode to a surrogate. These cases can occur in valid Python programs because chr() is unconstrained (for example). On input, Unicode conformance means that when using the surrogateescape handler, an alleged UTF-8 stream containing a 6-byte sequence that would algorithmically decode to a surrogate pair should be represented internally as a sequence of 6 surrogates from the first row of the low surrogate range. If the surrogateescape handler is not in use, it should raise an error. Sorry about not testing actual behavior, gotta run to a meeting. I forget what PEP 383 says about other Unicode codecs.

Masklinn writes:
No, it's written from the viewpoint that it says *nothing* about internal encodings, only about the encodings used in interchange of textual data, and about certain aspects of the processes that may receive and generate such data (eg, when data matches a Unicode regular expression, or how bidirectional text should appear visually).
Parsing the FAQ with that viewpoint, I believe a CPython string (unicode) must not contain surrogate codes:
No, it says no such thing. All the Unicode Standard (and the FAQ) says is that if Python generates output that purports to be text encoded in Unicode, it may not contain surrogate codes except where those codes are used according to UTF-16 to encode characters in planes 2 to 17, and if it receives data alleged to be Unicode in some transformation format, it must raise an error if it receives surrogates other than a correctly formed surrogate pair in text known to be encoded as UTF-16. In fact (as I wrote before without proper citation), the internal encoding of Python has been extended by PEP 383 to use a subset of the surrogate space to represent undecodable bytes in an octet stream, when the error handler is set to "surrogateescape". Furthermore, there is nothing to stop a Python unicode from containing any code unit (including both surrogates and other non-characters like 0xFFFF). Checking of the rules you cite is done by codecs, at encoding and decoding time.

08.10.13 16:02, Steven D'Aprano написав(ла):
So... I'm not sure why this will be useful.
This is a bug. http://bugs.python.org/issue12892

Masklinn writes:
True, but Python doesn't actually use UCS2 or UCS4 internally. It uses UCS2 or UCS4 plus a row of codes from the surrogate area to represent undecodable bytes. This feature is optional (enabled by using the appropriate error= setting in the codec), but I don't suppose it's going to go away.

Le Tue, 08 Oct 2013 14:17:59 +0300, Serhiy Storchaka <storchaka@gmail.com> a écrit :
Not true for slicing (you can take a non-surrogates slice of a surrogates string). Other than that, this sounds reasonable to me, provided that the patch isn't too complex and the perf improvements are worth it. Regards Antoine.

On 08.10.2013 13:17, Serhiy Storchaka wrote:
I guess you could use one bit from the kind structure for that: /* Character size: - PyUnicode_WCHAR_KIND (0): * character type = wchar_t (16 or 32 bits, depending on the platform) - PyUnicode_1BYTE_KIND (1): * character type = Py_UCS1 (8 bits, unsigned) * all characters are in the range U+0000-U+00FF (latin1) * if ascii is set, all characters are in the range U+0000-U+007F (ASCII), otherwise at least one character is in the range U+0080-U+00FF - PyUnicode_2BYTE_KIND (2): * character type = Py_UCS2 (16 bits, unsigned) * all characters are in the range U+0000-U+FFFF (BMP) * at least one character is in the range U+0100-U+FFFF - PyUnicode_4BYTE_KIND (4): * character type = Py_UCS4 (32 bits, unsigned) * all characters are in the range U+0000-U+10FFFF * at least one character is in the range U+10000-U+10FFFF */ unsigned int kind:3; For some reason, it allocates 3 bits, but only 2 bits are used. The again, the state struct is unsigned int, so there's still plenty of room for extra flags. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 08 2013)
2013-10-14: PyCon DE 2013, Cologne, Germany ... 6 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 08.10.2013 18:16, Serhiy Storchaka wrote:
Ok, then just add the flag to the end of the list... we'd still have at least 7 bits left on most platforms, IICC. PS: I guess this use of kind should be documented clearly somewhere. The unicodeobject.h file only hints at this and for PyUnicode_WCHAR_KIND this interpretation cannot be used. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 08 2013)
2013-10-14: PyCon DE 2013, Cologne, Germany ... 6 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

I like the idea. I prefer to add another flag (1 bit), instead of having a complex with 4 different values. Your idea looks specific to the PEP 393, so I prefer to keep the flag private. Otherwise it would be hard for other implementations of Python to implement the function getting the flag value. Victor 2013/10/8 Serhiy Storchaka <storchaka@gmail.com>:

08.10.13 15:23, Victor Stinner написав(ла):
I like the idea. I prefer to add another flag (1 bit), instead of having a complex with 4 different values.
We need at least 3-states value: yes, no, may be. But combining with is_ascii flag we need only one additional bit. I think that it shouldn't be more complex.
Yes, of course.

2013/10/8 Serhiy Storchaka <storchaka@gmail.com>:
Knowing if a string contains any surrogate character would also speedup marshal and pickle modules: http://bugs.python.org/issue19219#msg199465 Victor
participants (12)
-
Antoine Pitrou
-
Bruce Leban
-
Greg Ewing
-
M.-A. Lemburg
-
Masklinn
-
random832@fastmail.us
-
Serhiy Storchaka
-
Stephen J. Turnbull
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Victor Stinner