[Python-Dev] PEP 393 Summer of Code Project
Guido van Rossum
guido at python.org
Thu Aug 25 05:11:20 CEST 2011
On Wed, Aug 24, 2011 at 7:47 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> On Thu, Aug 25, 2011 at 12:29 PM, Guido van Rossum <guido at python.org> wrote:
>> Now I am happy to admit that for many Unicode issues the level at
>> which we have currently defined things (code units, I think -- the
>> thingies that encodings are made of) is confusing, and it would be
>> better to switch to the others (code points, I think). But characters
>> are right out.
>
> Indeed, code points are the abstract concept and code units are the
> specific byte sequences that are used for serialisation (FWIW, I'm
> going to try to keep this straight in the future by remembering that
> the Unicode character set is defined as abstract points on planes,
> just like geometry).
Hm, code points still look pretty concrete to me (integers in the
range 0 .. 2**21) and code units don't feel like byte sequences to me
(at least not UTF-16 code units -- in Python at least you can think of
them as integers in the range 0 .. 2**16).
> With narrow builds, code units can currently come into play
> internally, but with PEP 393 everything internal will be working
> directly with code points. Normalisation, combining characters and
> bidi issues may still affect the correctness of unicode comparison and
> slicing (and other text manipulation), but there are limits to how
> much of the underlying complexity we can effectively hide without
> being misleading.
Let's just define a Unicode string to be a sequence of code points and
let libraries deal with the rest. Ok, methods like lower() should
consider characters, but indexing/slicing should refer to code points.
Same for '=='; we can have a library that compares by applying (or
assuming?) certain normalizations. Tom C tells me that case-less
comparison cannot use a.lower() == b.lower(); fine, we can add that
operation to the library too. But this exceeds the scope of PEP 393,
right?
--
--Guido van Rossum (python.org/~guido)
More information about the Python-Dev
mailing list