[XML-SIG] Re: Issues with Unicode type

Mike Brown mike@skew.org
Thu, 26 Sep 2002 10:04:55 -0600 (MDT)


Lars Marius Garshol wrote:
> 
> * Martin v. Loewis
> | 
> | In addition, UTF-32 is a transfer form, UCS-4 is a code set. 
> 
> That's interesting. I wasn't aware of that distinction. I assume the
> same applies to UTF-16/UCS-2, then?

Sorta. UCS-4 is more than just a "code set" though. And IIRC there was some
debate over whether UTF-32 fit the definition of being a true UTF. If you're
going to get into that level of understanding, carefully read the following:

  http://www.unicode.org/unicode/reports/tr17/

and then reconcile its terminology and examples with this (from Unicode 3.0 
chapter 3.8):

  D29  A Unicode (or UCS) transformation format (UTF) transforms each
       Unicode scalar value into a sequence of code values. A UTF may
       also specify a byte order for the serialization of the code
       values into bytes. A UTF may also specify the use of a byte
       order mark.
  
and this (from Unicode 3.0 appendix C.2):

  ISO/IEC 10646 defines two alternative forms of encoding:

    -  A four-octet (32-bit) encoding containing 2^31 code positions.
       These code positions are conceptually divided into 128 groups
       of 256 planes, each plane containing 256 rows of 256 cells.

    -  A two-octet (16-bit) encoding consisting of plane zero, the
       Basic Multilingual Plane.

  The 32-bit form is referred to as UCS-4 (Universal Character Set
  coded in 4 octets) and the 16-bit form is referred to as UCS-2
  (Universal Character Set coded in 2 octets).

Have fun :)

   - Mike
____________________________________________________________________________
  mike j. brown                   |  xml/xslt: http://skew.org/xml/
  denver/boulder, colorado, usa   |  resume: http://skew.org/~mike/resume/