[Python-Dev] New Py_UNICODE doc

Sat May 7 20:41:31 CEST 2005

Shane Hathaway wrote:
> Martin v. Löwis wrote:
> 
>>Shane Hathaway wrote:
>>
>>
>>>I agree that UCS4 is needed.  There is a balancing act here; UTF-16 is
>>>widely used and takes less space, while UCS4 is easier to treat as an
>>>array of characters.  Maybe we can have both: unicode objects start with
>>>an internal representation in UTF-16, but get promoted automatically to
>>>UCS4 when you index or slice them.  The difference will not be visible
>>>to Python code.  A compile-time switch will not be necessary.  What do
>>>you think?
>>
>>
>>This breaks backwards compatibility with existing extension modules.
>>Applications that do PyUnicode_AsUnicode get a Py_UNICODE*, and
>>can use that to directly access the characters.
> 
> 
> Py_UNICODE would always be 32 bits wide.  PyUnicode_AsUnicode would
> cause the unicode object to be promoted automatically.  Extensions that
> break as a result are technically broken already, aren't they?  They're
> not supposed to depend on the size of Py_UNICODE.

-1.

You are free to compile Python with --enable-unicode=ucs4
if you prefer this setting.

I don't see any reason why we should force users to invest 4 bytes
of storage for each Unicode code point - 2 bytes work just fine
and can represent all Unicode characters that are currently
defined (using surrogates if necessary). As more and more
Unicode objects are used in a process, choosing UCS2 vs. UCS4
does make a huge difference in terms of used memory.

All this talk about UTF-16 vs. UCS-2 is not very useful
and strikes me a purely academic.

The reference to possibly breakage by slicing a Unicode and
breaking a surrogate pair is valid, the idea of UCS-4 being
less prone to breakage is a myth:

Unicode has many code points that are meant only for composition
and don't have any standalone meaning, e.g. a combining acute
accent (U+0301), yet they are perfectly valid code points -
regardless of UCS-2 or UCS-4. It is easily possible to break
such a combining sequence using slicing, so the most
often presented argument for using UCS-4 instead of UCS-2
(+ surrogates) is rather weak if seen by daylight.

Some may now say that combining sequences are not used
all that often. However, they play a central role in Unicode
normalization (http://www.unicode.org/reports/tr15/),
which is needed whenever you want to semantically
compare Unicode objects and are

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 07 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::