[XML-SIG] Re: Issues with Unicode type
Mike Brown
mike@skew.org
Thu, 26 Sep 2002 10:04:55 -0600 (MDT)
Lars Marius Garshol wrote:
>
> * Martin v. Loewis
> |
> | In addition, UTF-32 is a transfer form, UCS-4 is a code set.
>
> That's interesting. I wasn't aware of that distinction. I assume the
> same applies to UTF-16/UCS-2, then?
Sorta. UCS-4 is more than just a "code set" though. And IIRC there was some
debate over whether UTF-32 fit the definition of being a true UTF. If you're
going to get into that level of understanding, carefully read the following:
http://www.unicode.org/unicode/reports/tr17/
and then reconcile its terminology and examples with this (from Unicode 3.0
chapter 3.8):
D29 A Unicode (or UCS) transformation format (UTF) transforms each
Unicode scalar value into a sequence of code values. A UTF may
also specify a byte order for the serialization of the code
values into bytes. A UTF may also specify the use of a byte
order mark.
and this (from Unicode 3.0 appendix C.2):
ISO/IEC 10646 defines two alternative forms of encoding:
- A four-octet (32-bit) encoding containing 2^31 code positions.
These code positions are conceptually divided into 128 groups
of 256 planes, each plane containing 256 rows of 256 cells.
- A two-octet (16-bit) encoding consisting of plane zero, the
Basic Multilingual Plane.
The 32-bit form is referred to as UCS-4 (Universal Character Set
coded in 4 octets) and the 16-bit form is referred to as UCS-2
(Universal Character Set coded in 2 octets).
Have fun :)
- Mike
____________________________________________________________________________
mike j. brown | xml/xslt: http://skew.org/xml/
denver/boulder, colorado, usa | resume: http://skew.org/~mike/resume/