[Python-ideas] Processing surrogates in
Steven D'Aprano
steve at pearwood.info
Thu May 14 15:18:33 CEST 2015
On Thu, May 14, 2015 at 05:21:10AM -0700, Andrew Barnert via Python-ideas wrote:
> On May 14, 2015, at 03:15, Serhiy Storchaka <storchaka at gmail.com> wrote:
[...]
> > There is also the UTF-7 encoding that allows surrogates.
>
> Encoding to UTF-7 requires first encoding to UTF-16 and then doing the
> modified-base-64 thing. And decoding from UTF-7 requires reversing
> both those steps. There's no way surrogates can escape into Unicode
> from that. I suppose you could, instead of decoding from UTF-7, just
> do the base 64 decode and then skip the UTF-16 decode and instead just
> widen the code units, but that's not a valid thing to do, and I can't
> see why anyone would do it.
I don't see how UTF-7 could include surrogates. It's a 7-bit encoding,
which means it can only include bytes \x00 through \x7F, i.e. ASCII
compatible.
http://unicode.org/glossary/#UTF_7
For example, this passes:
for i in range(0x110000):
c = chr(i)
b = c.encode('utf-7')
m = max(b)
assert m <= 127
so where are the surrogates coming from?
> > And yet one source of surrogates -- Python sources. eval(), etc.
>
> If I type '\uD834\uDD1E' in Python 3.4 source, am I actually going to
> get an illegal Unicode string made of 2 surrogate code points instead
> of either an error or the single-character string '\U0001D11E'?
I certainly hope so :-)
I think that we should understand Unicode strings as sequences of code
points from U+0000 to U+10FFFF inclusive. I don't think we should try to
enforce a rule that all Python strings are surrogate-free. That would make it awfully inconvenient to
process the whole Unicode character set at once, like I did above. I'd
need to write:
for i in list(range(0xD800)) + list(range(0xE000, 0x110000)):
...
instead, or catch the exception in chr(i), or something equally
annoying.
The cost of that simplicity is that when you go to encode to bytes, you
might get an exception. I think so long as we have tools for dealing
with that (e.g. str->str transformations to remove or replace
surrogates) that's a fair trade-off.
Another possibility would be to introduce a separate type,
strict_unicode, which does enforce the rule that there are no surrogates
in [strict unicode] strings. But having two unicode string types might
be overkill/confusing. I think it might be better to have a is_strict()
or is_surrogate() method that reports if the string contains surrogates,
and let the user remove or replace them as needed.
> If so, again, I think that's a bug that needs to be fixed, not worked
> around. There's no legitimate reason for any source code to expect
> that to be an illegal length-2 string.
Well, there's backwards compatibility.
There's also testing:
assert unicodedata.category('\uD800') == 'Cs'
I'm sure there are others.
--
Steve
More information about the Python-ideas
mailing list