[I18n-sig] Unicode surrogates: just say no!

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Wed, 27 Jun 2001 22:53:11 +0200


> That's a separate question.  On wide interpreters, surrogate pairs
> "shouldn't" exist if the app plays by the rules.  But they're easily
> created of course!  What should ord(u'\uD800\uDC00') mean on a wide
> interpreter?  I think it's nice if you support this.  Of course, if a
> length-two Unicode string is anything else than a high surrogate
> followed by a low surrogate, ord() should be illegal.

But then, you get unichr(ord(u'\uD800\uDC00')) <> u'\uD800\uDC00'.
Is that acceptable?

I'd rather prefer ord not to work on surrogate pairs. It means that
code may behave differently, but that is no surprise:
len(u'\U00102030') already varies depending on the width of unicode.

Regards,
Martin