[I18n-sig] How does Python Unicode treat surrogates?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Sun, 24 Jun 2001 00:19:22 +0200


> > Likewise, but somewhat more troubling, surrogates that straddle write
> > invocations are not processed properly.
> > 
> > >>> s=StringIO.StringIO()
> > >>> _,_,r,w=codecs.lookup("utf-8")
> > >>> f=w(s)
> > >>> f.write(u"\ud800")
> > >>> f.write(u"\udc00")
> > >>> f.flush()
> > >>> s.getvalue()
> > '\xa0\x80\xed\xb0\x80'
> > 
> > whereas the correct answer would have been
> > 
> > >>> u"\ud800\udc00".encode("utf-8")
> > '\xf0\x90\x80\x80'
> 
> This is a special case of the above (since the encoder will
> see truncated surrogates and should raise raise an exception 
> for these).

I don't think it should; it is not truncated since a later write call
will provide the missing word. If you have a Unicode stream, it should
be possible to read the stream contents in arbitrary chunks of works,
and encode it with a stream encode. 

The stream encoder should produce the same output no matter how you
split the input. Under your proposed behaviour, this is not the case.

Please note that

http://sourceforge.net/tracker/index.php?func=detail&aid=433882&group_id=5470&atid=105470

adds a few other aspects to the problem: It appears that Unicode 3.1
specifies that certain forms of UTF-8 encoded surrogates are merely
irregular, not illegal. There may be some misinterpretation of the
spec in this report, but I think all this needs careful checking.

Regards,
Martin