[Python-ideas] Processing surrogates in

Thu May 14 15:18:33 CEST 2015

On Thu, May 14, 2015 at 05:21:10AM -0700, Andrew Barnert via Python-ideas wrote:
> On May 14, 2015, at 03:15, Serhiy Storchaka <storchaka at gmail.com> wrote:

[...]
> > There is also the UTF-7 encoding that allows surrogates.
> 
> Encoding to UTF-7 requires first encoding to UTF-16 and then doing the 
> modified-base-64 thing. And decoding from UTF-7 requires reversing 
> both those steps. There's no way surrogates can escape into Unicode 
> from that. I suppose you could, instead of decoding from UTF-7, just 
> do the base 64 decode and then skip the UTF-16 decode and instead just 
> widen the code units, but that's not a valid thing to do, and I can't 
> see why anyone would do it.

I don't see how UTF-7 could include surrogates. It's a 7-bit encoding, 
which means it can only include bytes \x00 through \x7F, i.e. ASCII 
compatible. 

http://unicode.org/glossary/#UTF_7

For example, this passes:

for i in range(0x110000):
    c = chr(i)
    b = c.encode('utf-7')
    m = max(b)
    assert m <= 127

so where are the surrogates coming from?

> > And yet one source of surrogates -- Python sources. eval(), etc.
> 
> If I type '\uD834\uDD1E' in Python 3.4 source, am I actually going to 
> get an illegal Unicode string made of 2 surrogate code points instead 
> of either an error or the single-character string '\U0001D11E'?

I certainly hope so :-)

I think that we should understand Unicode strings as sequences of code 
points from U+0000 to U+10FFFF inclusive. I don't think we should try to 
enforce a rule that all Python strings are surrogate-free. That would make it awfully inconvenient to 
process the whole Unicode character set at once, like I did above. I'd 
need to write:

for i in list(range(0xD800)) + list(range(0xE000, 0x110000)):
    ...

instead, or catch the exception in chr(i), or something equally 
annoying.

The cost of that simplicity is that when you go to encode to bytes, you 
might get an exception. I think so long as we have tools for dealing 
with that (e.g. str->str transformations to remove or replace 
surrogates) that's a fair trade-off.

Another possibility would be to introduce a separate type, 
strict_unicode, which does enforce the rule that there are no surrogates 
in [strict unicode] strings. But having two unicode string types might 
be overkill/confusing. I think it might be better to have a is_strict() 
or is_surrogate() method that reports if the string contains surrogates, 
and let the user remove or replace them as needed.

> If so, again, I think that's a bug that needs to be fixed, not worked 
> around. There's no legitimate reason for any source code to expect 
> that to be an illegal length-2 string.

Well, there's backwards compatibility.

There's also testing:

assert unicodedata.category('\uD800') == 'Cs'

I'm sure there are others.

-- 
Steve