[Python-Dev] UCS2/UCS4 default
M.-A. Lemburg
mal at egenix.com
Thu Jul 3 21:16:03 CEST 2008
On 2008-07-03 19:35, Jeroen Ruigrok van der Werven wrote:
> -On [20080703 19:21], Adam Olsen (rhamph at gmail.com) wrote:
>> On Thu, Jul 3, 2008 at 7:57 AM, M.-A. Lemburg <mal at egenix.com> wrote:
>>> Please remember that lone surrogate pair code points are perfectly
>>> valid Unicode code points, nevertheless. Just as a lone combining
>>> code point is valid on its own.
>> That is a big part of these problems. For all practical purposes, a
>> surrogate is like a UTF-8 code unit, and must be handled the same way,
>> so why the heck do they confuse everybody by saying "oh, it's a code
>> point too!"?
>
> Because surrogate code points are not Unicode scalar values, isolated UTF-16
> code units in the range 0xd800-0xdfff are ill-formed. (D91 from Unicode
> 5.0/5.1, section 3.9)
True. They are not valid UTF-16 code units, but a code unit is
just a storage byte representation of a Unicode tranformation...
"""
Code Unit. The minimal bit combination that can represent a unit of encoded text for processing or interchange. The
Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and
32-bit code units in the UTF-32 encoding form. (See definition D77 in Section 3.9, Unicode Encoding Forms.)
"""
That's not the same thing as a code point which is an assignment
of a slot in the Unicode character set...
"""
Code Point. Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10
in Section 3.4, Characters and Encoding.)
"""
Reference: http://www.unicode.org/glossary/
Also see Chapter 3.4 (http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#G2212):
"""
Surrogate code points and noncharacters are considered assigned code points,
but not assigned characters.
"""
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, Jul 03 2008)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
2008-07-07: EuroPython 2008, Vilnius, Lithuania 3 days to go
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
More information about the Python-Dev
mailing list