[I18n-sig] Re: Unicode surrogates: just say no!

Daniel Biddle deltab@osian.net
Mon, 2 Jul 2001 21:21:49 +0000


On Mon, Jul 02, 2001 at 03:05:13PM -0400, François Pinard wrote:
> [Guido van Rossum]
> 
> > When using UCS-4 mode, I was in favor of allowing unichr() and \U to
> > specify any value in range(0x100000000L) 
> 
> I did not check recently, but would think Unicode and 10646 are defined
> on 31 bits, not 32.  If you represent an UCS-4 code within a 32 bit int,
> it will never be negative.  It might be useful to rely on this.

Certainly ISO 10646 is defined as 31-bit. Unicode was 16-bit, but now uses
just under 20.09 bits.

> P.S. - Would not 32 bits also require one more byte in UTF-8?

Yes:

     bits  1111110x  10xxxxxx  10xxxxxx  10xxxxxx  10xxxxxx  10xxxxxx
  control     7       2         2         2         2         2        = 17
     data         1       6         6         6         6         6    = 31

UTF-8 allows at most 6 bytes, which can encode 31 bits.

It's been proposed that UTF-8 and UTF-32 be limited to values up to U+10FFFF,
which is the limit of UTF-16.

-- 
Daniel Biddle <deltab@osian.net>