[Python-ideas] Add "has_surrogates" flags to string object
Steven D'Aprano
steve at pearwood.info
Tue Oct 8 16:20:09 CEST 2013
On Tue, Oct 08, 2013 at 03:48:18PM +0200, Masklinn wrote:
> On 2013-10-08, at 15:02 , Steven D'Aprano wrote:
> > py> surr.encode('utf-8')
> > Traceback (most recent call last):
> > File "<stdin>", line 1, in <module>
> > UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in
> > position 0: surrogates not allowed
> >
> > as per the standard:
> >
> > http://www.unicode.org/faq/utf_bom.html#utf8-5
> >
> > I *think* you are supposed to be able to encode surrogate *pairs* to
> > UTF-8, if I'm reading the FAQ correctly
>
> I'm reading the opposite, from http://www.unicode.org/faq/utf_bom.html#utf8-4:
>
> > there is a widespread practice of generating pairs of three byte
> > sequences in older software, especially software which pre-dates the
> > introduction of UTF-16 or that is interoperating with UTF-16
> > environments under particular constraints. Such an encoding is not
> > conformant to UTF-8 as defined.
>
> Pairs of 3-byte sequences would be encoding each surrogate directly to
> UTF-8, whereas a single 4-byte sequence would be decoding the surrogate
> pair to a codepoint and encoding that codepoint to UTF-8. My reading
> of the FAQ makes the second interpretation the only valid one.
It's not that clear to me. I fear the Unicode FAQs don't distinguish
between Unicode strings and bytes well enough for my liking :(
But for the record, my interpretion is that if you have a pair of code
points constisting of the same values as a valid surrogate pair, you
should be able to encode to UTF-8. To give a concrete example:
Given:
c = '\N{LINEAR B SYLLABLE B038 E}' # \U00010001
c.encode('utf-8')
=> b'\xf0\x90\x80\x81'
and:
c.encode('utf-16BE') # encodes as a surrogate pair
=> b'\xd8\x00\xdc\x01'
then those same surrogates, taken as codepoints, should be encodable as
UTF-8:
'\ud800\udc01'.encode('utf-8')
=> b'\xf0\x90\x80\x81'
I'd actually be disappointed if that were the case; I think that would
be a poor design. But if that's what the Unicode standard demands,
Python ought to support it.
But hopefully somebody will explain to me why my interpretation is wrong
:-)
[...]
> The FAQ reads a bit strangely, I think because it's written from the
> viewpoint that the "internal encoding" will be UTF-16, and UTF-8 and
> UTF-32 are transcoding from that. Which does not apply to CPython and
> the FSR.
Hmmm... well, that might explain it. If it's written by Java programmers
for Java programmers, they may very well decide that having spent 20
years trying to convince people that string != ASCII, they're now
going to convince them that string == UTF-16 instead :/
> Parsing the FAQ with that viewpoint, I believe a CPython string (unicode)
> must not contain surrogate codes: a surrogate pair should have been
> decoded from UTF-16 to a codepoint (then identity-encoded to UCS4) and a
> single surrogate should have been caught by the UTF-16 decoder and
> should have triggered the error handler at that point. A surrogate code
> in a CPython string means the string is corrupted[1].
I think that interpretation is a bit strong. I think it would be fair to
say that CPython strings may contain surrogates, but you can't encode
them to bytes using the UTFs. Nor are there any byte sequences that can
be decoded to surrogates using the UTFs.
This essentially means that you can only get surrogates in a string
using (e.g.) chr() or \u escapes, and you can't then encode them to
bytes using UTF encodings.
> Surrogates *may* appear in binary data, while building a UTF-16
> bytestream by hand.
But there you're talking about bytes, not byte strings. Byte strings can
contain any bytes you like :-)
--
Steven
More information about the Python-ideas
mailing list