[I18n-sig] How does Python Unicode treat surrogates?
Martin v. Loewis
martin@loewis.home.cs.tu-berlin.de
Sat, 23 Jun 2001 14:20:38 +0200
> About surrogate support in Python: the UTF-8 codec has full
> surrogate support for encodings and decoding
I think there are a number of bugs lying around here. For example,
shouldn't
>>> u" \ud800 ".encode("utf-8")
' \xa0\x80 '
give an error, since this is a lone low surrogate word?
Likewise, but somewhat more troubling, surrogates that straddle write
invocations are not processed properly.
>>> s=StringIO.StringIO()
>>> _,_,r,w=codecs.lookup("utf-8")
>>> f=w(s)
>>> f.write(u"\ud800")
>>> f.write(u"\udc00")
>>> f.flush()
>>> s.getvalue()
'\xa0\x80\xed\xb0\x80'
whereas the correct answer would have been
>>> u"\ud800\udc00".encode("utf-8")
'\xf0\x90\x80\x80'
Regards,
Martin