[I18n-sig] Re: Unicode surrogates: just say no!
Daniel Biddle
deltab@osian.net
Mon, 2 Jul 2001 21:21:49 +0000
On Mon, Jul 02, 2001 at 03:05:13PM -0400, François Pinard wrote:
> [Guido van Rossum]
>
> > When using UCS-4 mode, I was in favor of allowing unichr() and \U to
> > specify any value in range(0x100000000L)
>
> I did not check recently, but would think Unicode and 10646 are defined
> on 31 bits, not 32. If you represent an UCS-4 code within a 32 bit int,
> it will never be negative. It might be useful to rely on this.
Certainly ISO 10646 is defined as 31-bit. Unicode was 16-bit, but now uses
just under 20.09 bits.
> P.S. - Would not 32 bits also require one more byte in UTF-8?
Yes:
bits 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
control 7 2 2 2 2 2 = 17
data 1 6 6 6 6 6 = 31
UTF-8 allows at most 6 bytes, which can encode 31 bits.
It's been proposed that UTF-8 and UTF-32 be limited to values up to U+10FFFF,
which is the limit of UTF-16.
--
Daniel Biddle <deltab@osian.net>