[Python-Dev] Divorcing str and unicode (no more implicit conversions).

M.-A. Lemburg mal at egenix.com
Tue Oct 25 10:38:14 CEST 2005


Neil Hodgson wrote:
> M.-A. Lemburg:
> 
> 
>>Unicode has the concept of combining code points, e.g. you can
>>store an "é" (e with a accent) as "e" + "'". Now if you slice
>>off the accent, you'll break the character that you encoded
>>using combining code points.
>>...
>>    next_<indextype>(u, index) -> integer
>>
>>        Returns the Unicode object index for the start of the next
>>        <indextype> found after u[index] or -1 in case no next element
>>        of this type exists.
> 
> 
>    Should entity breakage be further discouraged by returning a slice
> here rather than an object index?

You mean a slice that slices out the next <indextype> ?

>    Something like:
> 
> i = first_grapheme(u)
> x = 0
> while x < width and u[i] != "\n":
>    x, _ = draw(u[i], (x, y))
>    i = next_grapheme(u, i)

This sounds a lot like you'd want iterators for the various
index types. Should be possible to implement on top of the
proposed APIs, e.g. itergraphemes(u), itercodepoints(u), etc.

Note that what most people refer to as "character" is a
grapheme in Unicode speak. Given that interpretation,
"breaking" Unicode "characters" is something you won't
ever work around with by using larger code units such
as UCS4 compatible ones.

Furthermore, you should also note that surrogates (two
code units encoding one code point) are part of Unicode
life. While you don't need them when storing Unicode
in UCS4 code units, they can still be part of the
Unicode data and the programmer has to be aware of
these.

I personally, don't think that slicing Unicode is
such a big issue. If you know what you are doing,
things tend not to break - which is true for pretty
much everything you do in programming ;-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 25 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::


More information about the Python-Dev mailing list