[I18n-sig] Unicode surrogates: just say no!

Guido van Rossum guido@digicool.com
Tue, 26 Jun 2001 15:39:16 -0400


> guido wrote:
> 
> > - with 16-bit (narrow) Py_UNICODE:
> > 
> >   - unichr(i) for 0 <= i <= 0xffff always returns a size-one string
> >     where ord(u[0]) == i
> > 
> >   - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u
> >     and \U) generates a surrogate pair, where u[0] is the high
> >     surrogate value and u[1] the low surrogate value
> > 
> >   - unichr(i) for i >= 0x110000 (and hence corresponding \u and \U)
> >     raises an exception at Python-to-bytecode compile-time
> 
> or in other words:
> 
> >>> unichr.__doc__
> 'unichr(i) -> Unicode character\n\nReturn a Unicode string of one character with
> ordinal i; 0 <= i < 1114112.'

I would write 0 <= i <= 0x10ffff, but otherwise, yes.  Check it in
already!

> note that unichr raises a ValueError, not a UnicodeError.  should this
> be changed?

I think not.  The input value is wrong, that's a ValueError.  There
are lots of ValueErrors in the Unicode implementation.  There are lots
of UnicodeErrors too; the distinction isn't always clear.  MAL?

--Guido van Rossum (home page: http://www.python.org/~guido/)