[Python-Dev] Divorcing str and unicode (no more implicit conversions).

M.-A. Lemburg mal at egenix.com
Mon Oct 24 10:40:28 CEST 2005

Neil Hodgson wrote:
> Guido van Rossum:
>>Folks, please focus on what Python 3000 should do.
>>I'm thinking about making all character strings Unicode (possibly with
>>different internal representations a la NSString in Apple's Objective
>>C) and introduce a separate mutable bytes array data type. But I could
>>use some validation or feedback on this idea from actual
>    I'd like to more tightly define Unicode strings for Python 3000.
> Currently, Unicode strings may be implemented with either 2 byte
> (UCS-2) or 4 byte (UTF-32) elements. Python should allow strings to
> contain any Unicode character and should be indexable yielding
> characters rather than half characters. Therefore Python strings
> should appear to be UTF-32. There could still be multiple
> implementations (using UTF-16 or UTF-8) to preserve space but all
> implementations should appear to be the same apart from speed and
> memory use.

There seems to be a general misunderstanding here: even if you
have UCS4 storage, it is still possible to slice a Unicode
string in a way which makes rendering it correctly.

Unicode has the concept of combining code points, e.g. you can
store an "é" (e with a accent) as "e" + "'". Now if you slice
off the accent, you'll break the character that you encoded
using combining code points.

Note that combining code points are rather common in encodings
of Asian scripts, so this is not an artificial example.

Some time ago I proposed a new module called unicodeindex
to help with indexing. It would solve most of the indexing
issues you run into when dealing with Unicode. I've attached
it to this email for reference.

More on the used terms:


Marc-Andre Lemburg

Professional Python Services directly from the Source  (#1, Oct 24 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: pep-unicodeindex.txt
Url: http://mail.python.org/pipermail/python-dev/attachments/20051024/dacea951/pep-unicodeindex.txt

More information about the Python-Dev mailing list