[Python-ideas] Add "has_surrogates" flags to string object

Tue Oct 8 15:02:08 CEST 2013

On Tue, Oct 08, 2013 at 01:58:20PM +0200, Masklinn wrote:
> 
> On 2013-10-08, at 13:43 , Serhiy Storchaka wrote:
> 
> > 08.10.13 14:38, Masklinn написав(ла):
> >> On 2013-10-08, at 13:17 , Serhiy Storchaka wrote:
> >> 
> >>> Here is an idea about adding a mark to PyUnicode object which 
> >>> allows fast answer to the question if a string has surrogate code. 
> >>> This mark has one of three possible states:
[...]
> >> Isn't that redundant with the kind under shortest form representation?
> > 
> > No, it isn't redundant. '\udc80' is UCS2 string with surrogate code, and '\udc80\U00010000' is UCS4 string with surrogate code.
> 
> I don't know the details of the flexible string representation, but I
> believed the names fit what was actually in memory. UCS2 does not
> have surrogate pairs, thus surrogate codes make no sense in UCS2,
> they're a UTF-16 concept. Likewise for UCS4. Surrogate codes are not
> codepoints, they have no reason to appear in either UCS2 or UCS4
> outside of encoding errors.

I welcome correction, but I think you're mistaken. Python 3.3 strings 
don't have surrogate *pairs*, but they can contain surrogate *code 
points*. Unicode states:

"Isolated surrogate code points have no interpretation; consequently, no 
character code charts or names lists are provided for this range."

http://www.unicode.org/charts/PDF/UDC00.pdf
http://www.unicode.org/charts/PDF/UD800.pdf

So technically surrogates are "non-characters". That doesn't mean they 
are forbidden though; you can certainly create them, and encode them to 
UTF-16 and -32:

py> surr = '\udc80'
py> import unicodedata as ud
py> ud.category(surr)
'Cs'
py> surr.encode('utf-16')
b'\xff\xfe\x80\xdc'
py> surr.encode('utf-32')
b'\xff\xfe\x00\x00\x80\xdc\x00\x00'

However, you cannot encode single surrogates to UTF-8:

py> surr.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in 
position 0: surrogates not allowed

as per the standard:

http://www.unicode.org/faq/utf_bom.html#utf8-5

I *think* you are supposed to be able to encode surrogate *pairs* to 
UTF-8, if I'm reading the FAQ correctly, but it seems Python 3.3 doesn't 
support that. In any case, it is certainly legal to have Unicode strings 
containing non-characters, including surrogates, and you can encode them 
to UTF-16 and -32.

However, it looks like surrogates won't round trip in UTF-16, but they 
will in UTF-32:

py> surr.encode('utf-16').decode('utf-16') == surr
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 2-3: 
unexpected end of data
py> surr.encode('utf-32').decode('utf-32') == surr
True

So... I'm not sure why this will be useful. Presumably Unicode strings 
containing surrogate code points will be rare, and you can't encode them 
to UTF-8 at all, and you can't round trip them from UTF-16.

-- 
Steven